MIT - Applied Parallel Computing - Alan Edelman
MIT - Applied Parallel Computing - Alan Edelman
338, SMA
5505
Applied Parallel Computing
Spring 2004
1 Introduction 1
1.1 The machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Reality of High Performance Computing . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Modern Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Scientific Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 History, State-of-Art, and Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7.1 Things that are not traditional supercomputers . . . . . . . . . . . . . . . . . 4
1.8 Analyzing the top500 List Using Excel . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8.1 Importing the XML file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.3 Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Parallel Computing: An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Parallel Prefix 33
3.1 Parallel Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 The “Myth” of lg n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Applications of Parallel Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Segmented Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iii
iv Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
6 Parallel Machines 85
6.0.1 More on private versus shared addressing . . . . . . . . . . . . . . . . . . . . 92
6.0.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.0.3 Machine Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.0.4 Homogeneous and heterogeneous machines . . . . . . . . . . . . . . . . . . . 94
6.0.5 Distributed Computing on the Internet and Akamai Network . . . . . . . . . 95
7 FFT 97
7.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1.1 Data motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.1.2 FFT on parallel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Basic Data Communication Operations . . . . . . . . . . . . . . . . . . . . . . . . . 102
Introduction
This book strives to study that elusive mix of theory and practice so important for understanding
modern high performance computing. We try to cover it all, from the engineering aspects of
computer science (parallel architectures, vendors, parallel languages, and tools) to the mathematical
understanding of numerical algorithms, and also certain aspects from theoretical computer science.
Any understanding of the subject should begin with a quick introduction to the current scene in
terms of machines, vendors, and performance trends. This sets a realistic expectation of maximal
performance and an idea of the monetary price. Then one must quickly introduce at least one,
though possibly two or three software approaches so that there is no waste in time in using a com-
puter. Then one has the luxury of reviewing interesting algorithms, understanding the intricacies of
how architecture influences performance and how this has changed historically, and also obtaining
detailed knowledge of software techniques.
1
2 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
the keyboard and mouse are idle. Should Condor detect that a machine is no longer
available (such as a key press detected), in many circumstances Condor is able to trans-
parently produce a checkpoint and migrate a job to a different machine which would
otherwise be idle.1
Condor can run parallel computations across multiple Condor nodes using PVM or MPI, but (for
now) using MPI requires dedicated nodes that cannot be used as desktop machines.
1.5 Compilers
[move to a software section?]
Most parallel languages are defined by adding parallel extensions to well-established sequential
languages, such as C and Fortran. Such extensions allow user to specify various levels of parallelism
in an application program, to define and manipulate parallel data structures, and to specify message
passing among parallel computing units.
Compilation has become more important for parallel systems. The purpose of a compiler is
not just to transform a parallel program in a high-level description into machine-level code. The
role of compiler optimization is more important. We will always have discussions about: Are we
developing methods and algorithms for a parallel machine or are we designing parallel machines for
algorithm and applications? Compilers are meant to bridge the gap between algorithm design and
machine architectures. Extension of compiler techniques to run time libraries will further reduce
users’ concern in parallel programming.
Software libraries are important tools in the use of computers. The purpose of libraries is
to enhance the productivity by providing preprogrammed functions and procedures. Software
1
http://www.cs.wisc.edu/condor/description.html
4 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
libraries provide even higher level support to programmers than high-level languages. Parallel
scientific libraries embody expert knowledge of computer architectures, compilers, operating sys-
tems, numerical analysis, parallel data structures, and algorithms. They systematically choose a
set of basic functions and parallel data structures, and provide highly optimized routines for these
functions that carefully consider the issues of data allocation, data motion, load balancing, and
numerical stability. Hence scientists and engineers can spend more time and be more focused on
developing efficient computational methods for their problems. Another goal of scientific libraries
is to make parallel programs portable from one parallel machine platform to another. Because of
the lack, until very recently, of non-proprietary parallel programming standards, the development
of portable parallel libraries has lagged far behind the need for them. There is good evidence now,
however, that scientific libraries will be made more powerful in the future and will include more
functions for applications to provide a better interface to real applications.
Due to the generality of scientific libraries, their functions may be more complex than needed
for a particular application. Hence, they are less efficient than the best codes that take advantage of
the special structure of the problem. So a programmer needs to learn how to use scientific libraries.
A pragmatic suggestion is to use functions available in the scientific library to develop the first
prototype and then to iteratively find the bottleneck in the program and improve the efficiency.
• distributed.net works on RSA Laboratories’ RC5 cipher decryption contest and also
searches for optimal Golumb rulers.
Whereas the above examples of benevolent or harmless distributed computing, there are also
other sorts of distributed computing which are frowned upon, either by the entertainment industry
in the first example below, or universally in the latter two.
• Peer-to-peer file-sharing (the original Napster, followed by Gnutella, KaZaA, Freenet, and the
like) can viewed as a large distributed supercomputer, although the resource being parallelized
is storage rather than processing. KaZaA itself is notable because the client software contains
an infamous hook (called Altnet) which allows a KaZaA corporate partner (Brilliant Digital
Entertainment) to load and run arbritrary code on the client computer. Brilliant Digital
has been quoted as saying they plan to use Altnet as “the next advancement in distributed
bandwidth, storage and computing.”2 Altnet has so far only been used to distribute ads to
KaZaA users.
• Distributed Denial of Service (DDoS) attacks harness thousands of machines (typically com-
promised through a security vulnerability) to attack an Internet host with so much traffic
that it becomes too slow or otherwise unusable by legitimate users.
• Spam e-mailers have also increasingly turned to hundreds or thousands compromised machines
to act as remailers, as a response to the practice of “blacklisting” machines thought to be
spam mailers. The disturbing mode of attack is often a e-mail message containing a trojan
attachment, which when executed opens a backdoor that the spammer can use to send more
spam.
Although the class will not focus on these non-traditional supercomputers, the issues they have to
deal with (communication between processors, load balancing, dealing with unreliable nodes) are
similar to the issues that will be addressed in this class.
Click Open. A small window pops up asking how you would like to open the file. Leave the
default as is and click OK (Figure 1.2).
6 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
1.8.2 Filtering
Click OK again. Something as in Figure 1.3 should show up.
If you don’t see the bolded column titles with the arrows next to them, go to the menubar
and select Data → Filter → Autofilter (make sure it is checked). Type Ctrl-A to highlight all the
entries. Go to the menubar and select Format → Column → Width. You can put in any width
you’d like, but 10 would work.
You can now filter the data in numerous ways. For example, if you want to find out all SGI
installations in the US, you can click on the arrow in column K (country) and select United States.
Then, click on the arrow in column F (manufacturer) and select SGI. The entire data set is now
sorted to those select machines (Figure 1.5).
If you continue to click on other arrows, the selection will become more and more filtered. If
you would like to start over, go to the menubar and select Data → Filter → Show All. Assuming
that you’ve started over with all the data, we can try to see all machines in both Singapore and
Malaysia. Click on the arrow in column K (country) again and select (Custom...). You should see
something as in Figure 1.6.
On the top right side, pull down the arrow and select Singapore. On the lower left, pull down
the arrow and select equals. On the lower right, pull down the arrow and select Malaysia (Figure
1.6.
Now click OK. You should see a blank filtered list because there are no machines that are in
both Singapore and Malaysia (Figure 1.7).
If you want to find out all machines in Singapore or India, you have to start off with all the
data again. You perform the same thing as before except that in the Custom AutoFilter screen,
you should click on the Or toggle button. You should also type in India instead of Malaysia (Figure
2
Brilliant Digital Entertainment’s Annual Report (Form 10KSB), Filed with SEC April 1, 2002.
8 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
1.8).
Click OK and you should see something as in Figure 1.9.
Now scroll down the PivotTable Field List again, find ns1:computer, and use your mouse to drag
it to where it says “Drop Data Items Here” (Figure 1.12).
Notice that at the top left of the table it says “Count of ns1:computer”. This means that it is
counting up the number of machines for each individual country. Now let’s fine the highest rank of
each machine. Click on the gray ns1:country column heading and drag it back to the PivotTable
Field List. Now from the PivotTable Field List, drag ns1:computer to the column where the country
list was before. Click on the gray Count of ns1:computer column heading and drag it back to the
PivotTable Field List. From the same list, drag ns1:rank to where the ns1:computer information
was before (it says “Drop Data Items Here”). The upper left column heading should now be: Sum
of ns1: rank. Double click on it and the resulting pop up is shown in Figure 1.13.
To find the highest rank of each machine, we need the data to be sorted by Min. Click on Min
and click OK (Figure 1.14).
The data now shows the highest rank for each machine. Let’s find the number of machines
worldwide for each vendor. Following the same procedures as before, we replace the Row Fields
and Data Items with ns1:manufacturer and ns1:computer, respectively. We see that the upper left
column heading says “Count of ns1:computer”, so we now it is finding the sum of the machines for
each vendor. The screen should look like Figure 1.15.
Now let’s find the minimum rank for each vendor. By now we should all be experts at this!
The tricky part here is that we actually want the Max of the ranks. The screen you should get is
shown in Figure 1.16.
Note that it is also possible to create 3D pivot tables. Start out with a blank slate by dragging
everything back to the Pivot Table Field List. Then, from that list, drag ns1:country to where
it says “Drop Row Fields Here”. Drag ns1:manufacturer to where it says “Drop Column Fields
Here”. Now click on the arrow under the ns1:country title, click on Show All (which releases on the
checks), Australia, Austria, Belarus, and Belgium. Click on the arrow under the ns1:manufacturer
12 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
title, click on Show All, Dell, Hewlett-Packward, and IBM. Drag ns1:rank from the Pivot Table
Field List to where it says “Drop Data Items Here”. Double-click on the Sum of ns1:rank and
select Count instead of Sum. Finally, drag ns1:year to where it says “Drop Page Fields Here”. You
have successfully completed a 3D pivot table that looks like Figure 1.17.
As you can imagine, there are infinite possibilities for auto-filter and pivot table combinations.
Have fun with it!
x1 + x 2 + . . . + x P ,
where xi is a floating point number and for purposes of exposition, let us assume P is
a power of 2.
Obviously, the sum can be computed in log P (base 2 is understood) steps, simply by adding
neighboring pairs recursively. (The algorithm has also been called pairwise summation and cascade
summation ). The data flow in this computation can be thought of as a binary tree.
(Illustrate on tree.)
Nodes represent values, either input or the result of a computation. Edges communicate values
from their definition to their uses.
This is an example of what is often known as a reduce operation. We can replace the addition
operation with any associative operator to generalize. (Actually, floating point addition is not
Preface 15
associative, leading to interesting numerical questions about the best order in which to add numbers.
This is studied by Higham [49, See p. 788].)
There are so many questions though. How would you write a program to do this on P processors?
Is it likely that you would want to have such a program on P processors? How would the data (the
xi ) get to these processors in the first place? Would they be stored there or result from some other
computation? It is far more likely that you would have 10000P numbers on P processors to add.
A correct parallel program is often not an efficient parallel program. A flaw in a parallel program
that causes it to get the right answer slowly is known as a performance bug. Many beginners will
write a working parallel program, obtain poor performance, and prematurely conclude that either
parallelism is a bad idea, or that it is the machine that they are using which is slow.
What are the sources of performance bugs? We illustrate some of them with this little, admit-
tedly contrived example. For this example, imagine four processors, numbered zero through three,
each with its own private memory, and able to send and receive message to/from the others. As a
simple approximation, assume that the time consumed by a message of size n words is A + Bn.
the other processors for proper load balancing. The results are added up on each
processor, and then processor 0 adds the final four numbers together.
True, there is a tiny bit of load imbalance here, since processor zero does those few
last additions. But that is nothing compared with the cost it incurs in shipping out
the data to the other processors. In order to get that data out onto the network,
it incurs a large cost that does not drop with the addition of more processors. (In
fact, since the number of messages it sends grows like the number of processors,
the time spent in the initial communication will actually increase.)
3. A sequential bottleneck. Let’s assume the data are initially spread out among
the processors; processor zero has the numbers 1 through 250,000, etc. Assume
that the owner of i will add it to the running total. So there will be no load
imbalance. But assume, further, that we are constrained to add the numbers
in their original order! (Sounds silly when adding, but other algorithms require
exactly such a constraint.) Thus, processor one may not begin its work until it
receives the sum 0 + 1 + · · · + 250, 000 from processor zero!
We are thus requiring a sequential computation: our binary summation tree is
maximally unbalanced, and has height 106 . It is always useful to know the critical
path—the length of the longest path in the dataflow graph—of the computation
being parallelized. If it is excessive, change the algorithm!
Some problems look sequential such as Fibonacci: Fk+1 = Fk + Fk−1 , but looks
can be deceiving. Parallel prefix will be introduced later, which can be used to
parallelize the Fibonacci computation.
1.10 Exercises
1. Compute the sum of 1 through 1, 000, 000 using HPF. This amounts to a “hello world” pro-
gram on whichever machine you are using. We are not currently aware of any free distributions
of HPF for workstations, so your instructor will have to suggest a computer to use.
2. Download MPI to your machine and compute the sum of 1 through 1, 000, 000 us-
ing C or Fortran with MPI. A number of MPI implementations may be found at
http://www-unix.mcs.anl.gov/mpi/
3. In HPF, generate 1, 000, 000 real random numbers and sort them. (Use RANDOM_NUMBER and
GRADE_UP.
5. Set up the excessive communication situation described as the second bad parallel algorithm.
Place the numbers 1 through one million in a vector on one processor. Using four processors
see how quickly you can get the sum of the numbers 1 through one million.
Lecture 2
A parallel language must provide mechanisms for implementing parallel algorithms, i.e., to spec-
ify various levels of parallelism and define parallel data structures for distributing and sharing
information among processors.
Most current parallel languages add parallel constructs for standard sequential languages. Dif-
ferent parallel languages provide different basic constructs. The choice largely depends on the
parallel computing model the language means to support.
There are at least three basic parallel computation models other than vector and pipeline model:
data parallel, message passing, and shared memory task parallel.
17
18 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
sync;
return x+y;
}
}
Actually Cilk is a little bit different right now, but this is the way the program will look like when
you read these notes. The whole point is that the two computations of fib(n-1) and fib(n-2) can
be executed in parallel. As you might have expected, there are dozens of multithreaded languages
(functional, imperative, declarative) and implementation techniques; in some implementations the
thread size is a single-instruction long, and special processors execute this kind of programs. Ask
Arvind at LCS for details.
Writing a good HPF compiler is difficult and not every manufacturer provides one; actually for
some time TMC machines were the only machines available with it. The first HPF compiler for
the Intel Paragon dates December 1994.
Why SIMD is not necessarily the right model for data parallel programming. Consider the
following Fortran fragment, where x and y are vectors:
where (x > 0)
y = x + 2
elsewhere
y = -x + 5
endwhere
A SIMD machine might execute both cases, and discard one of the results; it does twice the
needed work (see why? there is a single flow of instructions). This is how the CM-2 operated.
On the other hand, an HPF compiler for a SIMD machine can take advantage of the fact that
there will be many more elements of x than processors. It can execute the where branch until there
are no positive elements of x that haven’t been seen, then it can execute the elsewhere branch
until all other elements of x are covered. It can do this provided the machine has the ability to
generate independent memory addresses on every processor.
Moral: even if the programming model is that there is one processor per data element, the
programmer (and the compiler writer) must be aware that it’s not true.
Many highly parallel machines have been, and still are, just collections of independent computers
on some sort of a network. Such machines can be made to have just about any data sharing and
synchronization mechanism; it just depends on what software is provided by the operating system,
the compilers, and the runtime libraries. One possibility, the one used by the first of these machines
(The Caltech Cosmic Cube, from around 1984) is message passing. (So it’s misleading to call these
“message passing machines”; they are really multicomputers with message passing library software.)
From the point of view of the application, these computers can send a message to another
computer and can receive such messages off the network. Thus, a process cannot touch any data
other than what is in its own, private memory. The way it communicates is to send messages to
and receive messages from other processes. Synchronization happens as part of the process, by
virtue of the fact that both the sending and receiving process have to make a call to the system in
order to move the data: the sender won’t call send until its data is already in the send buffer, and
the receiver calls receive when its receive buffer is empty and it needs more data to proceed.
Message passing systems have been around since the Cosmic Cube, about ten years. In that
time, there has been a lot of evolution, improved efficiency, better software engineering, improved
functionality. Many variants were developed by users, computer vendors, and independent software
companies. Finally, in 1993, a standardization effort was attempted, and the result is the Message
Passing Interface (MPI) standard. MPI is flexible and general, has good implementations on all
the machines one is likely to use, and is almost certain to be around for quite some time. We’ll
use MPI in the course. On one hand, MPI is complicated considering that there are more than 150
functions and the number is still growing. But on the other hand, MPI is simple because there are
only six basic functions: MPI Init, MPI Finalize, MPI Comm rank, MPI Comm size, MPI Send and
MPI Recv.
In print, the best MPI reference is the handbook Using MPI, by William Gropp, Ewing Lusk,
and Anthony Skjellum, published by MIT Press ISBN 0-262-57104-8.
The standard is on the World WIde Web. The URL is
http://www-unix.mcs.anl.gov/mpi/
2.2.1 Who am I?
On the SP-2 and other multicomputers, one usually writes one program which runs on all the
processors. In order to differentiate its behavior, (like producer and consumer) a process usually
first finds out at runtime its rank within its process group, then branches accordingly. The calls
sets size to the number of processes in the group specified by comm and the call
sets rank to the rank of the calling process within the group (from 0 up to n − 1 where n is the
size of the group). Usually, the first thing a program does is to call these using MPI COMM WORLD as
the communicator, in order to find out the answer to the big questions, “Who am I?” and “How
many other ‘I’s are there?”.
Okay, I lied. That’s the second thing a program does. Before it can do anything else, it has to
make the call
where argc and argv should be pointers to the arguments provided by UNIX to main(). While
we’re at it, let’s not forget that one’s code needs to start with
#include "mpi.h"
MPI_Finalize()
No arguments.
Here’s an MPI multi-process “Hello World”:
#include <stdio.h>
#include "mpi.h"
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
printf("Hello world! This is process %d out of %d\n",myrank, nprocs);
if (myrank == 0) printf("Some processes are more equal than others.");
MPI_Finalize();
} /* main */
Another important thing to know about is the MPI wall clock timer:
double MPI_Wtime()
which returns the time in seconds from some unspecified point in the past.
sends count items of data of type datatype starting at the location buf. In all message passing
systems, the processes have identifiers of some kind. In MPI, the process is identified by its rank,
an integer. The data is sent to the processes whose rank is dest. Possible values for datatype
are MPI INT, MPI DOUBLE, MPI CHAR etc. tag is an integer used by the programmer to allow the
receiver to select from among several arriving messages in the MPI Recv. Finally, comm is something
called a communicator, which is essentially a subset of the processes. Ordinarily, message passing
occurs within a single subset. The subset MPI COMM WORLD consists of all the processes in a single
parallel job, and is predefined.
A receive call matching the send above is
buf is where the data is placed once it arrives. count, an input argument, is the size of the buffer;
the message is truncated if it is longer than the buffer. Of course, the receive has to be executed by
the correct destination process, as specified in the dest part of the send, for it to match. source
must be the rank of the sending process. The communicator and the tag must match. So must the
datatype.
The purpose of the datatype field is to allow MPI to be used with heterogeneous hardware.
A process running on a little-endian machine may communicate integers to another process on a
big-endian machine; MPI converts them automatically. The same holds for different floating point
formats. Type conversion, however, is not supported: an integer must be sent to an integer, a
double to a double, etc.
Suppose the producer and consumer transact business in two word integer packets. The pro-
ducer is process 0 and the consumer is process 1. Then the send would look like this:
int outgoing[2];
MPI_Send(outgoing, 2, MPI_INT, 1 100, MPI_COMM_WORLD)
MPI_Status stat;
int incoming[2];
MPI_Recv(incoming, 2, MPI_INT, 0 100, MPI_COMM_WORLD, &stat)
What if one wants a process to which several other processes can send messages, with service
provided on a first-arrived, first-served basis? For this purpose, we don’t want to specify the source
in our receive, and we use the value MPI ANY SOURCE instead of an explicit source. The same is
true if we want to ignore the tag: use MPI ANY TAG. The basic purpose of the status argument,
which is an output argument, is to find out what the tag and source of such a received message
are. status.MPI TAG and status.MPI SOURCE are components of the struct status of type int
that contain this information after the MPI Recv function returns.
This form of send and receive are “blocking”, which is a technical term that has the following
meaning. for the send, it means that buf has been read by the system and the data has been
moved out as soon as the send returns. The sending process can write into it without corrupting
the message that was sent. For the receive, it means that buf has been filled with data on return.
(A call to MPI Recv with no corresponding call to MPI Send occurring elsewhere is a very good and
often used method for hanging a message passing application.)
MPI implementations may use buffering to accomplish this. When send is called, the data are
copied into a system buffer and control returns to the caller. A separate system process (perhaps
22 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
using communication hardware) completes the job of sending the data to the receiver. Another
implementation is to wait until a corresponding receive is posted by the destination process, then
transfer the data to the receive buffer, and finally return control to the caller. MPI provides
two variants of send, MPI Bsend and MPI Ssend that force the buffered or the rendezvous imple-
mentation. Lastly, there is a version of send that works in “ready” mode. For such a send, the
corresponding receive must have been executed previously, otherwise an error occurs. On some
systems, this may be faster than the blocking versions of send. All four versions of send have the
same calling sequence.
NOTE: MPI allows a process to send itself data. Don’t try it. On the SP-2, if the message is big
enough, it doesn’t work. Here’s why. Consider this code:
if (myrank == 0)
for(dest = 0; dest < size; dest++)
MPI_Send(sendbuf+dest*count, count, MPI_INT, dest, tag, MPI_COMM_WORLD);
MPI_Recv(recvbuf, count, MPI_INT, 0, tag, MPI_COMM_WORLD, &stat);
The programmer is attempting to send data from process zero to all processes, including process
zero; 4 · count bytes of it. If the system has enough buffer space for the outgoing messages, this
succeeds, but if it doesn’t, then the send blocks until the receive is executed. But since control
does not return from the blocking send to process zero, the receive never does execute. If the
programmer uses buffered send, then this deadlock cannot occur. An error will occur if the system
runs out of buffer space, however:
if (myrank == 0)
for(dest = 0; dest < size; dest++)
MPI_Bsend(sendbuf+dest*count, count, MPI_INT, dest, tag, MPI_COMM_WORLD);
MPI_Recv(recvbuf, count, MPI_INT, 0, tag, MPI_COMM_WORLD, &stat);
It looks simple, but there are a lot of subtleties here! First, note the use of MPI ANY SOURCE in
the receives. We’re happy to receive the data in the order it arrives. Second, note that we use two
different tag values to distinguish between the int and the float data. Why isn’t the MPI TYPE filed
enough? Because MPI does not include the type as part of the message “envelope”. The envelope
consists of the source, destination, tag, and communicator, and these must match in a send-receive
pair. Now the two messages sent to process zero from some other process are guaranteed to arrive
in the order they were sent, namely the integer message first. But that does not mean that all of
the integer message precede all of the float messages! So the tag is needed to distinguish them.
This solution creates a problem. Our code, as it is now written, sends off a lot of messages
with tags 100 and 101, then does the receives (at process zero). Suppose we called a library routine
written by another user before we did the receives. What if that library code uses the same message
tags? Chaos results. We’ve “polluted” the tag space. Note, by the way, that synchronizing the
processes before calling the library code does not solve this problem.
MPI provides communicators as a way to prevent this problem. The communicator is a part
of the message envelope. So we need to change communicators while in the library routine. To do
this, we use MPI Comm dup, which makes a new communicator with the same processes in the same
order as an existing communicator. For example
The messages sent and received inside the library code cannot interfere with those send outside.
Elapsed Time(n, r) = α + βn
work? (r is the receiver, and according to this model, the cost is receiver independent.)
In such a model, the latency for a message is α seconds, and the bandwidth is 1/β bytes/second.
Other models try to split α into two components. The first is the time actually spent by the sending
processor and the receiving processor on behalf of a message. (Some of the per-byte cost is also
attributed to the processors.) This is called the overhead. The remaining component of latency
is the delay as the message actually travels through the machines interconnect network. It is
ordinarily much smaller than the overhead on modern multicomputers (ones, rather than tens of
microseconds).
A lot has been made about the possibility of improving performance by “tolerating” communi-
cation latency. To do so, one finds other work for the processor to do while it waits for a message
to arrive. The simplest thing is for the programmer to do this explicitly. For this purpose, there
are “nonblocking” versions of send and receive in MPI and other dialects.
24 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Nonblocking send and receive work this way. A nonblock send start call initiates a send
but returns before the data are out of the send buffer. A separate call to send complete then
blocks the sending process, returning only when the data are out of the buffer. The same two-
phase protocol is used for nonblocking receive. The receive start call returns right away, and the
receive complete call returns only when the data are in the buffer.
The simplest mechanism is to match a nonblocking receive with a blocking send. To illustrate,
we perform the communication operation of the previous section using nonblocking receive.
MPI_Request request;
MPI_IRecv(recvbuf, count, MPI_INT, 0, tag, MPI_COMM_WORLD, &request);
if (myrank == 0)
for(dest = 0; dest < size; dest++)
MPI_Send(sendbuf+dest*count, count, MPI_INT, dest, tag, MPI_COMM_WORLD);
MPI_Wait(&request, &stat);
MPI Wait blocks until the nonblocking operation identified by the handle request completes. This
code is correct regardless of the availability of buffers. The sends will either buffer or block until
the corresponding receive start is executed, and all of these will be.
Before embarking on an effort to improve performance this way, one should first consider what
the payoff will be. In general, the best that can be achieved is a two-fold improvement. Often, for
large problems, it’s the bandwidth (the βn term) that dominates, and latency tolerance doesn’t
help with this. Rearrangement of the data and computation to avoid some of the communication
altogether is required to reduce the bandwidth component of communication time.
There are quite a few collective communication operations provided by MPI, all of them useful
and important. We will use several in the assignment. To mention a few, MPI Bcast broadcasts
a vector from one process to the rest of its process group; MPI Scatter sends different data from
one process to each process in its a group; MPI Gather is the inverse of a scatter: one process
receives and concatenates data from all processes in its group; MPI Allgather is like a gather
followed by a broadcast: all processes receive the concatenation of data that are initially distributed
among them; MPI Reduce scatter is like reduce followed by scatter: the result of the reduction
ends up distributed among the process group. Finally, MPI Alltoall implements a very general
communication in which each process has a separate message to send to each member of the process
group.
Often the process group in a collective communication is some subset of all the processors. In
a typical situation, we may view the processes as forming a grid, let’s say a 2d grid, for example.
We may want to do a reduction operation within rows of the process grid. For this purpose, we
can use MPI Reduce, with a separate communicator for each row.
To make this work, each process first computes its coordinates in the process grid. MPI makes
this easy, with
Next. one creates new communicators, one for each process row and one for each process column.
The calls
MPI_Comm_free(&my_prow);
MPI_Comm_free(&my_pcol);
which leaves the sum of the vectors x in the vector sum in the process whose rank in the group is
zero: this will be the first process in the row. The general form is
As in the example above, the group associated with comm is split into disjoint subgroups, one
for every different value of color; the communicator for the subgroup that this process belongs to
is returned in newcomm. The argument key determines the rank of this process within newcomm; the
members are ranked according to their value of key, with ties broken using the rank in comm.
26 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
• Synchrony:
Synchronous: sending function returns when a matching receiving operation has been initiated
at the destination.
Blocking: sending function returns when the message has been safely copied.
Non-blocking (asynchronous): sending function returns immediately. Message buffer must
not be changed until it is safe to do so
• Miscellaneous:
Interrupts: if enabled, the arrival of a message interrupts the receiving processor
Polling: the act of checking the occurrence of an event
Handlers: code written to execute when an interrupt occurs
Critical sections: sections of the code which cannot be interrupted safely
Scheduling: the process of determining which code to run at a given time
Priority: a relative statement about the ranking of an object according to some metric
• Software (libraries):
CrOS, CUBIX, NX, Express
Vertex
EUI/MPL (adjusts to the machine operating system)
CMMD
Preface 27
• Systems:
Mach
Linda (object-based system)
• Nodes:
CPU, local memory,perhaps local I/O
• Networks:
Topology: Hypercube,Mesh,Fat-Tree, other
Routing: circuit, packet, wormhole, virtual channel random
Bisection bandwidth (how much data can be sent along the net)
Reliable delivery
Flow Control
“State” : Space sharing, timesharing, stateless
http://www.openmp.org/.
OpenMP stands for Open specifications for Multi Processing. It specify a set of compiler direc-
tives, library routines and environment variables as an API for writing multi-thread applications
in C/C++ and Fortran. With these directives or pragmas, multi-thread codes are generated by
compilers. OpenMP is a shared momoery model. Threads communicate by sharing variables. The
cost of communication comes from the synchronization of data sharing in order to protect data
conficts.
Compiler pragmas in OpenMP take the following forms:
• for Fortran,
For example, OpenMP is usually used to parallelize loops, a simple parallel C program with its
loops split up looks like
void main()
{
double Data[10000];
#pragma omp parallel for
for (int i=0; i<10000; i++) {
task(Data[i]);
}
}
If all the OpenMP constructs in a program are compiler pragmas, then this program can be compiled
by compilers that do not support OpenMP.
OpenMP’s constructs fall into 5 categories, which are briefly introduced as the following:
1. Parallel Regions
Threads are created with the “omp parallel” pragma. In the following example, a 8-thread
parallel region is created:
double Data[10000];
omp_set_num_threads(8);
#pragma omp parallel
{
int ID = omp_get_thread_num();
task(ID,Data);
}
void main()
{
double Data[10000];
#pragma omp parallel
#pragma omp for
for (int i=0; i<10000; i++) {
task(Data[i]);
}
}
Preface 29
“Sections” is another work-sharing construct which assigns different jobs (different pieces of
code) to each thread in a parallel region. For example,
3. Data Environment
In the shared-memory programming model, global variables are shared among threads, which
are file scope and static variables for C, and common blocks, save and module variables for
Fortran. Automatic variables within a statement block are private, also stack variables in
sub-programs called from parallel regions. Constructs and clauses are available to selectively
change storage attributes.
• The “shared” clause uses a shared memory model for the variable, that is, all threads
share the same variable.
• The “private” clause gives a local copy of the variable in each thread.
• “firstprivate” is like “private”, but the variable is initialized with its value before the
thread creation.
• “lastprivate” is like “private”, but the value is passed to a global variable after the thread
execution.
4. Synchronization
The following constructs are used to support synchronization:
• “critical” and “end critical” constructs define a critical region, where only one thread
can enter at a time.
• “atomic” construct defines a critical region that only contains one simple statement.
• “barrier” construct is usually implicit, for example at the end of a parallel region or at
the end of a “for” work-sharing construct. The barrier construct makes threads wait
until all of them arrive.
• “ordered” construct enforces the sequential order for a block of code.
• “master” construct marks a block of code to be only executed by the master thread.
The other threads just skip it.
30 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
2.5 STARP
Star-P is a set of extensions to Matlab aimed at making it simple to parallelize many common
computations. A running instance of Star-P consists of a control Matlab process (connected to the
Matlab notebook/desktop the user sees) and a slaved Matlab processes for each processor available
to the system. It permits a matlab user to declare distributed matrix objects, whose contents are
distributed in various ways across the processors available to Star-P, and to direct each processor
to apply a matlab function to the portion of a distributed matrix it stores. It also supplies parallel
versions of many of the dense (and some of the sparse) matrix operations which occur in scientific
computing, such as eigenvector/eigenvalue computations.
This makes Star-P especially convenient for embarassingly parallel problems, as all one needs
to do is:
1. Write a matlab function which carries out a part of the embarassingly parallel problem,
parametrized such that it can take the inputs it needs as the rows or columns of a matrix.
2. Construct a distributed matrix containing the parameters needed for each slave process.
3. Tell each slave to apply the function to the portion of the matrix it has.
4. Combine the results from each process.
Now, some details. Star-P defines the variable np to store the number of slave processes it
controls. On a machine with two apparent processors, for example, np would equal 2. Distributed
matrices are declared in Star-P by appending *p to the dimension along which you want your
matrix to be distributed. For example, the code below declares A to be a row-distributed matrix;
each processor theoretically gets an equal chunk of rows from the matrix.
A = ones(100*p, 100);
To declare a column-distributed matrix, one simply does:
A = ones(100, 100*p);
Beware: if the number of processors you’re working with does not evenly divide the size of the
dimension you are distributing along, you may encounter unexpected results. Star-P also supports
matrices which are distributed into blocks, by appending *p to both dimensions; we will not go
into the details of this here.
After declaring A as a distributed matrix, simply evaluating it will yield something like:
Preface 31
This is because the elements of A are stored in the slave processors. To bring the contents to the
control process, simply index A. For example, if you wanted to view the entire contents, you’d do
A(:,:).
To apply a Matlab function to the chunks of a distributed matrix, use the mm command. It
takes a string containing a procedure name and a list of arguments, each of which may or may not
be distributed. It orders each slave to apply the function to the chunk of each distributed argument
(echoing non-distributed arguments) and returns a matrix containing the results of each appended
together. For example, mm(’fft’, A) (with A defined as a 100-by-100 column distributed matrix)
would apply the fft function to each of the 25-column chunks. Beware: the chunks must each be
distributed in the same way, and the function must return chunks of the same size. Also beware:
mm is meant to apply a function to chunks. If you want to compute the two-dimensional fft of A
in parallel, do not use mm(’fft2’, A); that will compute (in serial) the fft2s of each chunk of
A and return them in a matrix. eig(A), on the other hand, will apply the parallel algorithm for
eigenstuff to A.
Communication between processors must be mediated by the control process. This incurs
substantial communications overhead, as the information must first be moved to the control process,
processed, then sent back. It also necessitates the use of some unusual programming idioms; one
common pattern is to break up a parallel computation into steps, call each step using mm, then do
matrix column or row swapping in the control process on the distributed matrix to move information
between the processors. For example, given a matrix B = randn(10, 2*p) (on a Star-P process
with two slaves), the command B = B(:,[2,1]) will swap elements between the two processors.
Some other things to be aware of:
1. Star-P provides its functionality by overloading variables and functions from Matlab. This
means that if you overwrite certain variable names (or define your own versions of certain
functions), they will shadow the parallel versions. In particular, DO NOT declare a variable
named p; if you do, instead of distributing matrices when *p is appended, you will multiply
each element by your variable p.
2. persistent variables are often useful for keeping state across stepped computations. The
first time the function is called, each persistent variable will be bound to the empty matrix
[]; a simple if can test this and initialize it the first time around. Its value will then be
stored across multiple invocations of the function. If you use this technique, make sure to
clear those variables (or restart Star-P) to ensure that state isn’t propagated across runs.
4. If mm doesn’t appear to find your m-files, run the mmpath command (which takes one argument
- the directory you want mm to search in).
Have fun!
Lecture 3
Parallel Prefix
4. Return [bi ].
33
34 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
An example using the vector [1, 2, 3, 4, 5, 6, 7, 8] is shown in Figure 3.1. Going up the tree, we
simply compute the pairwise sums. Going down the tree, we use the updates according to points
2 and 3 above. For even position, we use the value of the parent node (bi ). For odd positions, we
add the value of the node left of the parent node (bi−1 ) to the current value (ai ).
We can create variants of the algorithm by modifying the update formulas 2 and 3. For example,
the excluded prefix sum
Figure 3.2 illustrates this algorithm using the same input vector as before.
The total number of ⊕ operations performed by the Parallel Prefix algorithm is (ignoring a
constant term of ±1):
I III
z}|{ II z}|{
n n
z }| {
Tn = + Tn/2 +
2 2
= n + Tn/2
= 2n
Preface 35
If there is a processor for each array element, then the number of parallel operations is:
I II III
z}|{ z }| { z}|{
Tn = 1 + Tn/2 + 1
= 2 + Tn/2
= 2 lg n
1. At each processor i, compute a local scan serially, for n/p consecutive elements, giving result
[di1 , di2 , . . . , dik ]. Notice that this step vectorizes over processors.
In the limiting case of p n, the lg p message passes are an insignificant portion of the
computational time, and the speedup is due solely to the availability of a number of processes each
doing the prefix operation serially.
A = [1 2 3 4 5 6 7 8 9 10]
C = [1 0 0 0 10 1 10 1]
plus scan(A, C) = [1 3 6 10 5 11 7 8 17 10 ]
36 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
L
We now show how to! reduce segmented scan to simple scan. We define an operator, 2 , whose
x
operand is a pair . We denote this operand as an element of the 2-element representation of
y
A and C, where x and y are corresponding elements from the vectors A and C. The operands of
the example above are given as:
! ! ! ! ! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
1 0 0 0 1 0 1 1 0 1
L
The operator ( 2) is defined as follows:
! !
L y y
2
0 1
! ! !
x x⊕y y
0 0 1
! ! !
x x⊕y y
1 1 1
L
As an exercise, we can show that the binary operator 2 defined above is associative and
exhibits the segmenting behavior we want: for each vector A and each boolean vector C, let AC
be the 2-element representation of A and C. For each binary associative operator ⊕, the result
L
of 2 scan(AC) gives a 2-element vector whose first row is equal to the vector computed by
segmented ⊕ scan(A, C). Therefore, we can apply the parallel scan algorithm to compute the
segmented scan.
Notice that the method of assigning each segment to a separate processor may results in load
imbalance.
The ci in this equation can be calculated by Leverier’s lemma, which relate the c i to sk = tr(Ak ).
The Csanky algorithm then, is to calculate the Ai by parallel prefix, compute the trace of each Ai ,
calculate the ci from Leverier’s lemma, and use these to generate A−1 .
Preface 37
Figure 3.3: Babbage’s Difference Engine, reconstructed by the Science Museum of London
While the Csanky algorithm is useful in theory, it suffers a number of practical shortcomings.
The most glaring problem is the repeated multiplication of the A matix. Unless the coefficients
of A are very close to 1, the terms of An are likely to increase towards infinity or decay to zero
quite rapidly, making their storage as floating point values very difficult. Therefore, the algorithm
is inherently unstable.
Charles Babbage is considered by many to be the founder of modern computing. In the 1820s he
pioneered the idea of mechanical computing with his design of a “Difference Engine,” the purpose
of which was to create highly accurate engineering tables.
A central concern in mechanical addition procedures is the idea of “carrying,” for example, the
overflow caused by adding two digits in decimal notation whose sum is greater than or equal to
10. Carrying, as is taught to elementary school children everywhere, is inherently serial, as two
numbers are added left to right.
However, the carrying problem can be treated in a parallel fashion by use of parallel prefix.
More specifically, consider:
c3 c2 c1 c0 Carry
a3 a2 a1 a0 First Integer
+ b3 b2 b1 b0 Second Integer
s4 s3 s2 s1 s0 Sum
By algebraic manipulation, one can create a transformation matrix for computing c i from ci−1 :
! ! !
ci ai + bi ai bi ci−1
= ·
1 0 1 1
Thus, carry look-ahead can be performed by parallel prefix. Each ci is computed by parallel
prefix, and then the si are calculated in parallel.
38 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
[MPI_Scan] is much like MPI_Allreduce in that the values are formed by combining
values contributed by each process and that each process receives a result. The difference
is that the result returned by the process with rank r is the result of operating on the
input elements on processes with rank 0, 1, . . . , r.
Essentially, MPI_Scan operates locally on a vector and passes a result to each processor. If
the defined operation of MPI_Scan is MPI_Sum, the result passed to each process is the partial sum
including the numbers on the current process.
MPI_Scan, upon further investigation, is not a true parallel prefix algorithm. It appears that
the partial sum from each process is passed to the process in a serial manner. That is, the message
passing portion of MPI_Scan does not scale as lg p, but rather as simply p. However, as discussed in
the Section 3.2, the message passing time cost is so small in large systems, that it can be neglected.
Lecture 4
Definition. (Wilkinson) A sparse matrix is a matrix with enough zeros that it is worth taking
advantage of them.
Definition. A structured matrix has enough structure that it is worthwhile to use it.
For example, a Toeplitz Matrix is defined by 2n parameters. All entries on a diagonal are the
same:
1 4
2 1 4
2 1 4
T oeplitzM atrix = .
.. .. .. .. ..
.. . . . . .
2 1 4
2 1
39
40 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
purposes of our discussion in this chapter, this does not seem to help for dense matrices. This does
not mean that any parallel dense linear algebra algorithm should be conceived or even code in a
serial manner. What it does mean, however, is that we look particularly hard at what the matrix
A represents in the practical application for which we are trying to solve the equation Ax = b. By
examining the matrix carefully, we might indeed recognize some other less-obvious ’structure’ that
we might be able to exploit.
4.2 Applications
There are not many applications for large dense linear algebra routines, perhaps due to the “law
of nature” below.
• “Law of Nature”: Nature does not throw n2 numbers at us haphazardly, therefore there are
few dense matrix problems.
Some believe that there are no real problems that will turn up n2 numbers to populate the
n × n matrix without exhibiting some form of underlying structure. This implies that we should
seek methods to identify the structure underlying the matrix. This becomes particularly important
when the size of the system becomes large.
What does it mean to ’seek methods to identify the structure’ ? Plainly speaking that answer
is not known not just because it is inherently difficult but also because prospective users of dense
linear algebra algorithms (as opposed to developers of such algorithms) have not started to identify
the structure of their A matrices. Sometimes identifying the structure might involve looking beyond
traditional literature in the field.
communication to collect and form our answer. If we did not pay attention to the condition number
of A and correspondingly the condition number of chunks of A that reside on different processors,
our numerical accuracy for the parralel computing task would suffer.
This was just one example of how even in a seemingly unstructured case, insights from another
field, random matrix theory in this case, could potentially alter our impact or choice of algorithm
design. Incidentally, even what we just described above has not been incorporated into any parallel
applications in radar processing that we are aware of. Generally speaking, the design of efficient
parallel dense linear algebra algorithms will have to be motivated by and modified based on specific
applications with an emphasis on uncovering the structure even in seemingly unstructured problems.
This, by definition, is something that only users of algorithms could do. Until then, an equally
important task is to make dense linear algebra algorithms and libraries that run efficiently regardless
of the underlying structure while we wait for the applications to develop.
While there are not too many everyday applications that require dense linear algebra solutions,
it would be wrong to conclude that the world does not need large linear algebra libraries. Medium
sized problems are most easily solved with these libraries, and the first pass at larger problems are
best done with the libraries. Dense methods are the easiest to use, reliable, predictable, easiest to
write, and work best for small to medium problems.
For large problems, it is not clear whether dense methods are best, but other approaches often
require far more work.
4.3 Records
Table 4.1 shows the largest dense matrices solved. Problems that warrant such huge systems
to be solved are typically things like the Stealth bomber and large Boundary Element codes 1 .
Another application for large dense problems arise in the “methods of moments”, electro-magnetic
calculations used by the military.
1
Typically this method involves a transformation using Greens Theorem from 3D to a dense 2D representation of
the problems. This is where the large data sets are generated.
42 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
It is important to understand that space considerations, not processor speeds, are what bound
the ability to tackle such large systems. Memory is the bottleneck in solving these large dense
systems. Only a tiny portion of the matrix can be stored inside the computer at any one time. It
is also instructive to look at how technological advances change some of these considerations.
For example, in 1996, the record setter of size n = 128, 600 required (2/3)n3 = 1.4 × 1015
arithmetic operations (or four times that many if it is a complex matrix) for its solution using
Gaussian elimination. On a fast uniprocessor workstation in 1996 running at 140 MFlops/sec,
that would take ten million seconds, about 16 and a half weeks; but on a large parallel machine,
running at 1000 times this speed, the time to solve it is only 2.7 hours. The storage requirement
was 8n2 = 1.3 × 101 3 bytes, however. Can we afford this much main memory? Again, we need to
look at it in historical perspective.
In 1996, the price was as low as $10 per megabyte it would cost $ 130 million for enough memory
for the matrix. Today, however, the price for the memory is much lower. At 5 cents per megabyte,
the memory for the same system would be $650,000. The cost is still prohibitive, but much more
realistic.
In contrast, the Earth Simulator which can solve a dense linear algebra system with n = 1041216
would require (2/3)n3 = 7.5 × 1017 arithmetic operations (or four times that many if it is a complex
matrix) for its solution using Gaussian elimination. For a 2.25 GHz Pentium 4 uniprocessor based
workstation available today, at a speed of 3 GFlops/sec this would take 250 million seconds or
roughly 414 weeks or about 8 years! On the Earth Simulator running at its maximum of 35.86
TFlops/sec or about 10000 times the speed of a desktop machine, this would only take about 5.8
hrs! The storage requirement for this machine would be 8n2 = 8.7 × 1014 bytes which at 5 cents a
megabyte works out to about $43.5 million. This is still equally prohibitive athough the figurative
’bang for the buck’ keeps getting better.
As in 1996, the cost for the storage was not as high as we calculated. This is because in 1996,
when most parallel computers were specially designed supercomputers, “out of core” methods were
used to store the massive amount of data. In 2004, however, with the emergence of clusters as a
viable and powerful supercomputing option, network storage capability and management becomes
an equally important factor that adds to the cost and complexity of the parallel computer.
In general, however, Moore’s law does indeed seem to be helpful because the cost per Gigabyte
especially for systems with large storage capacity keeps getting lower. Concurrently the density of
these storage media keeps increasing as well so that the amount of physical space needed to store
these systems becomes smalller. As a result, we can expect that as storage systems become cheaper
and denser, it becomes increasingly more practical to design and maintain parallel computers.
The accompanying figures show some of these trends in storage density, and price.
• communication is instantaneous
Preface 43
This is taught frequently in theory classes, but has no practical application. Communication
cost is critical, and no one can afford n2 processors when n = 128, 000.2
In practical parallel matrix computation, it is essential to have large chunks of each matrix
on each processor. There are several reasons for this. The first is simply that there are far more
matrix elements than processors! Second, it is important to achieve message vectorization. The
communications that occur should be organized into a small number of large messages, because of
the high message overhead. Lastly, uniprocessor performance is heavily dependent on the nature
of the local computation done on each processor.
To match the bandwidths of the fast processor and the slow memory, several added layers of
memory hierarchy are employed by architects. The processor has registers that are as fast as the
processing units. They are connected to an on-chip cache that is nearly that fast, but is small
(a few ten thousands of bytes). This is connected to an off-chip level-two cache made from fast
but expensive static random access memory (SRAM) chips. Finally, main memory is built from
the least cost per bit technology, dynamic RAM (DRAM). A similar caching structure supports
instruction accesses.
When LINPACK was designed (the mid 1970s) these considerations were just over the horizon.
Its designers used what was then an accepted model of cost: the number of arithmetic operations.
Today, a more relevant metric is the number of references to memory that miss the cache and
cause a cache line to be moved from main memory to a higher level of the hierarchy. To write
portable software that performs well under this metric is unfortunately a much more complex task.
In fact, one cannot predict how many cache misses a code will incur by examining the code. One
cannot predict it by examining the machine code that the compiler generates! The behavior of real
memory systems is quite complex. But, as we shall now show, the programmer can still write quite
acceptable code.
(We have a bit of a paradox in that this issue does not really arise on Cray vector computers.
These computers have no cache. They have no DRAM, either! The whole main memory is built of
SRAM, which is expensive, and is fast enough to support the full speed of the processor. The high
bandwidth memory technology raises the machine cost dramatically, and makes the programmer’s
job a lot simpler. When one considers the enormous cost of software, this has seemed like a
reasonable tradeoff.
Why then aren’t parallel machines built out of Cray’s fast technology? The answer seems
to be that the microprocessors used in workstations and PCs have become as fast as the vector
processors. Their usual applications do pretty well with cache in the memory hierarchy, without
reprogramming. Enormous investments are made in this technology, which has improved at a
remarkable rate. And so, because these technologies appeal to a mass market, they have simple
priced the expensive vector machines out of a large part of their market niche.)
X =
Creators of the LAPACK software library for dense linear algebra accepted the design challenge
of enabling developers to write portable software that could minimize costly cache misses on the
memory hierarchy of any hardware platform.
The LAPACK designers’ strategy to achieve this was to have manufacturers write fast BLAS,
especially for the BLAS3. Then, LAPACK codes call the BLAS. Ergo, LAPACK gets high perfor-
mance. In reality, two things go wrong. Manufacturers dont make much of an investment in their
BLAS. And LAPACK does other things, so Amdahl’s law applies.
interaction between algorithm and architecture can expose optimization possibilities. Figure 4.7
shows a graphical depiction of The Fundamental Triangle.
Some scalar a(i, j) algorithms may be expressed with square submatrix A(I : ∗ + N B − 1, J :
J + N B − 1) algorithms. Also, dense matrix factorization is a BLAS level 3 computation consisting
of a series of submatrix computations. Each submatrix computation is BLAS level 3, and each
matrix operand in Level 3 is used multiple times. BLAS level 3 computation is O(n 3 ) operations
on O(n2 ) data. Therefore, in order to minimize the expense of moving data in and out of cache,
the goal is to perform O(n) operations per data movement, and amortize the expense over ther
largest possible number of operations. The nature of dense linear algebra algorithms provides
the opportunity to do just that, with the potential closeness of data within submatrices, and the
frequent reuse of that data.
Architecture impact
The floating point arithmetic required for dense linear algebra computation is done in the L1 cache.
Operands must be located in the L1 cache in order for multiple reuse of the data to yield peak
performance. Moreover, operand data must map well into the L1 cache if reuse is to be possible.
Operand data is represented using Fortran/C 2-D arrays. Unfortunately, the matrices that these
2-D arrays represent, and their submatrices, do not map well into L1 cache. Since memory is one
dimensional, only one dimension of these arrays can be contiguous. For Fortran, the columns are
contiguous, and for C the rows are contiguous.
To deal with this issue, this theory proposes that algorithms should be modified to map the
input data from the native 2-D array representation to contiguous submatrices that can fit into the
L1 cache.
48 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Cholesky example
By utilizing the Recursive Block Format and by adopting a recursive strategy for dense linear
algorithms, concise algorithms emerge. Figure 4.10 shows one node in the recursion tree of a
recursive Cholesky algorithm. At this node, Cholesky is applied to a matrix of size n. Note that
Preface 49
n need not be the size of the original matrix, as this figure describes a node that could appear
anywhere in the recursion tree, not just the root.
The lower triangular matrix below the Cholesky node describes the input matrix in terms of its
recursive blocks, A11 , A21 , andA22
• When C(n1) has returned,L11 has been computed and it replaces A11
• C(n2), Cholesky of the updated A22 , is now computed recursively, and L22 is returned
The BLAS operations (i.e.DTRSM and DSYRK) can be implemented using matrix multiply,
and the operands to these operations are submatrices of A. This pattern generalizes to other dense
linear algebra computations (i.e. general matrix factor, QR factorization). Every dense linear
algebra algorithm calls the BLAS several times. Every one of the multiple BLAS calls has all of
its matrix operands equal to the submatrices of the matrices, A,B, .. of the dense linear algebra
algorithm. This pattern can be exploited to improve performance through the use of the Recursive
Data Format.
the hardware can represent. The laws of science and relate two and three dimensional objects.
We live in a three dimensional world. However, computer storage is one dimensional. Moreover,
mathmeticians have proved that it is not possible to maintain closeness between points in a neigh-
borhood unless the two objects have the same dimension. Despite this negative theorem, and the
limitations it implies on the relationship between data and available computer storage hardware,
recursion provides a good approximation. Figure 4.11 shows this graphically via David Hilberts
space filling curve.
So, should we convert the data from consecutive to cyclic order and from cyclic to consecutive
when we are done? The answer is “no”, and the better approach is to reorganize the algorithm
rather than the data. The idea behind this approach is to regard matrix indices as a set (not
necessarily ordered) instead of an ordered sequence.
In general if you have to rearrange the data, maybe you can rearrange the calculation.
Preface 51
1 2 3
4 5 6
7 8 9
Figure 4.13: A stage in Gaussian elimination using cyclic order, where the shaded portion refers to
the zeros and the unshaded refers to the non-zero elements
4.8.1 Problems
1. For performance analysis of the Gaussian elimination algorithm, one can ignore the operations
performed outside of the inner loop. Thus, the algorithm is equivalent to
do k = 1, n
do j = k,n
do i = k,n
a(i,j) = a(i,j) - a(i,k) * a(k,j)
enddo
Preface 53
enddo
enddo
The “owner” of a(i, j) gets the task of the computation in the inner loop, for all 1 ≤ k ≤
min(i, j).
Analyze the load imbalance that occurs in one-dimensional block mapping of the columns of
the matrix: n = bp and processor r is given the contiguous set of columns (r − 1)b + 1, . . . , rb.
(Hint: Up to low order terms, the average load per processor is n3 /(3p) inner loop tasks, but
the most heavily loaded processor gets half again as much to do.)
Repeat this analysis for the two-dimensional block mapping. Does this imbalance affect the
scalability of the algorithm? Or does it just make a difference in the efficiency by some
constant factor, as in the one-dimensional case? If so, what factor?
Finally, do an analysis for the two-dimensional cyclic mapping. Assume the p = q 2 , and that
n = bq for some blocksize b. Does the cyclic method remove load imbalance completely?
Lecture 5
The solution of a linear system Ax = b is one of the most important computational problems in
scientific computing. As we shown in the previous section, these linear systems are often derived
from a set of differential equations, by either finite difference or finite element formulation over a
discretized mesh.
The matrix A of a discretized problem is usually very sparse, namely it has enough zeros that can
be taken advantage of algorithmically. Sparse matrices can be divided into two classes: structured
sparse matrices and unstructured sparse matrices. A structured matrix is usually generated from
a structured regular grid and an unstructured matrix is usually generated from a non-uniform,
unstructured grid. Therefore, sparse techniques are designed in the simplest case for structured
sparse matrices and in the general case for unstructured matrices.
55
56 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
1. d1 = b1
2. e= c1 /d1
3. for i = 2 : n
The number float point operations is 3n upto a additive constant. With such factorization, we
can then solve the tridiagonal linear system in additional 5n float point operations. However, this
method is very reminiscent to the naive sequential algorithm for the prefix sum whose computation
graph has a critical path of length O(n). The cyclic reduction, developed by Golub and Hockney
[?], is very similar to the parallel prefix algorithm presented in Section ?? and it reduces the length
of dependency in the computational graph to the smallest possible.
The basic idea of the cyclic reduction is to first eliminate the odd numbered variables to obtain
a tridiagonal linear system of dn/2e equations. Then we solve the smaller linear system recursively.
Note that each variable appears in three equations. The elimination of the odd numbered variables
gives a tridiagonal system over the even numbered variables as following:
Recursively solving this smaller linear tridiagonal system, we obtain the value of x 2i for all
i = 1, ..., n/2. We can then compute the value of x2i−1 by the simple equation:
By simple calculation, we can show that the total number of float point operations is equal to
16n upto an additive constant. So the amount of total work is doubled compare with the sequential
algorithm discussed. But the length of the critical path is reduced to O(log n). It is worthwhile to
point out the the total work of the parallel prefix sum algorithm also double that of the sequential
algorithm. Parallel computing is about the trade-off of parallel time and the total work. The
discussion show that if we have n processors, then we can solve a tridiagonal linear system in
O(log n) time.
When the number of processor p is much less than n, similar to prefix sum, we hybrid the cyclic
reduction with the sequential factorization algorithm. We can show that the parallel float point
operations is bounded by 16n(n + log n)/p and the number of round of communication is bounded
by O(log p). The communication pattern is the nearest neighbor.
Cyclic Reduction has been generalized to two dimensional finite difference systems where the
matrix is a block tridiagonal matrix.
Preface 57
FILL
1 3 7 1 3 7
8 6 8 6
4 10 4 10
9 5 2 9 5 2
+
G(A) G(A)
1 3 7 10
9
8 6 4 5
8
4 10 2
7 3
6
9 5 2 1
+
G(A) T(A)
spy(A)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
nz = 300
The order of elimination determines both fill and elimination tree height. Unfortunately, but
inevitably, finding the best ordering is NP-complete. Heuristics are used to reduce fill-in. The
following lists some commonly used ones.
• nested dissection
• Cuthill-McKee ordering.
These ordering heuristics can be investigated in Matlab on various sparse matrices. The simplest
way to obtain a random sparse matrix is to use the command A=sprand(n,m,f), where n and m
denote the size of the matrix, and f is the fraction of nonzero elements. However, these matrices
are not based on any physical system, and hence may not illustrate the effectiveness of an ordering
scheme on a real world problem. An alternative is to use a database of sparse matrices, one of
which is available with the command/package ufget.
Once we have a sparse matrix, we can view it’s sparsity structure with the command spy(A).
An example with a randomly generated symmetric sparse matrix is given in Figure 5.3.
We now carry out Cholesky factorization of A using no ordering, and using SYMMMD. The
sparsity structures of the resulting triangular matrices are given in Figure 5.4. As shown, using a
heuristic-based ordering scheme results in significantly less fill in. This effect is usually more pro-
nounced when the matrix arises from a physical problem and hence has some associated structure.
We now examine an ordering method called nested dissection, which uses vertex separators in
a divide-and-conquer node ordering for sparse Gaussian elimination. Nested dissection [37, 38, 59]
was originally a sequential algorithm, pivoting on a single element at a time, but it is an attractive
60 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
90 90
100 100
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
nz = 411 nz = 255
parallel ordering as well because it produces blocks of pivots that can be eliminated independently
in parallel [9, 29, 40, 61, 74].
Consider a regular finite difference grid. By dissecting the graph along the center lines (enclosed
in dotted curves), the graph is split into four independent graphs, each of which can be solved in
parallel.
The connections are included only at the end of the computation in an analogous way to domain
decomposition discussed in earlier lectures. Figure 5.6 shows how a single domain can be split up
into two roughly equal sized domains A and B which are independent and a smaller domain C
that contains the connectivity.
One can now recursively order A and B, before finally proceeding to C. More generally, begin
by recursively ordering at the leaf level and then continue up the tree. The question now arises as
to how much fill is generated in this process. A recursion formula for the fill F generated for such
C
A B
n √
H(n) = H( ) + 2 n (5.4)
2
√
H(n) = const × n (5.5)
Nested dissection can be generalized to three dimensional regular grid or other classes of graphs
that have small separators. We will come back to this point in the section of graph partitioning.
elimination tree. This leads to the multifrontal algorithm. The sequential version of the algorithm
is given below. For every column j there is a block Ūj (which is equivalent to V D −1 V T ).
ljk
X l i1 k
Ūj = −
.. (ljk li k . . . li k )
1 r
k .
l ir k
where the sum is taken over all descendants of j in the elimination tree. j, i 1 , i2 , . . . , ir are the
indices of the non-zeros in column j of the Cholesky factor.
For j = 1, 2, . . . , n. Let j, i1 , i2 , . . . , ir be the indices of the non-zeros in column j of L. Let
c1 , . . . , cs be the children of j in the elimination tree. Let Ū = Uc1 l . . . l Ucs where the Ui ’s were
defined in a previous step of the algorithm. l is the extend-add operator which is best explained
by example. Let
5 8! 5 9!
5 p q 5 w x
R= ,S =
8 u v 9 y z
(The rows of R correspond to rows 5 and 8 of the original matrix etc.) Then
5 8 9
5 p+w q x
R l S = 8 u v 0
9 y 0 z
Define
ajj . . . ajir
.. ..
Fj = . . l Ū
a ir j . . . a ir ir
(This corresponds to C − V D −1 V T )
Now factor Fj
ljj 0 ... 0 1 0 ... 0 ljj l i1 j . . . l ir j
l i1 j 0 0
..
.. .
.
. I . Uj . I
l ir j 0 0
5.3.1 SuperLU-dist
SuperLU-dist is an iterative and approximate method for solving Ax = b. This simple algorithm
eliminates the need for pivoting. The elimination of pivoting enhances parallel implementations due
to the high communications overhead that pivoting imposes. The basic SuperLU-dist algorithm is
as follows:
Algorithm: SuperLU-dist
1. r = b − A ∗ x
ri
2. backerr = maxi ( (|A|∗|x|+|b|) i
)
lasterr
3. if (backerr < ) or (backerr > 2 ) then stop
4. solve: L ∗ U ∗ dx = r
5. x = x + dx
6. lasterr = backerr
7. loop to step 1
In this algorithm, x, L, and U are approximate while r is exact. This procedure usually converges
to a reasonable solution after only 0-3 iterations and the error is on the order of 10−n after n
iterations.
64 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
xi = D −1 (L + U ) xi−1 + D −1 b
This method presents some nice computational features. The inverse term involves only the
diagonal matrix, D. The computational cost of computing this inverse is minimal. Additionally,
this may be carried out easily in parallel since each entry in the inverse does not depend on any
other entry.
This method is often stable in practice but is less easy to parallelize. The inverse term is now
a lower triangular matrix which presents a bottleneck for parallel operations.
This method presents some practical improvements over the Jacobi method. Consider the
computation of the j th element of the solution vector xi at the ith iteration. The lower triangular
nature of the inverse term demonstrates only the information of the (j + 1)th element through
the nth elements of the previous iteration solution vector xi−1 are used. These elements contain
information not available when the j th element of xi−1 was computed. In essence, this method
updates using only the most recent information.
M xi = N xi−1 + b
• cost of computing M −1
• stability and convergence rate
It is interesting the analyze convergence properties of these methods. Consider the definitions
of absolute error, ei = x∗ − xi , and residual error, ri = Axi − b. An iteration using the above
algorithm yields the following.
Preface 65
x1 = M −1 N x0 + M −1 b
= M −1 (M − A)x0 + M −1 b
= x0 + M −1 r0
This shows that the convergence of the algorithm is in some way improved if the M −1 term
approximates A−1 with some accuracy. Consider the amount of change in the absolute error after
this iteration.
e1 = A−1 r0 − M −1 r0
= e0 − M −1 Ae0
= M −1 N e0
Evaluating this change for a general iteration shows the error propagation.
i
ei = M −1 N e0
This relationship shows a bound on the error convergence. The largest eigenvalue, or spectral
eigenvalue, of the matrix M −1 N determines the rate of convergence of these methods. This analysis
is similar to the solution of a general difference equation of the form xk = Axk−1 . In either case,
the spectral radius of the matrix term must be less than 1 to ensure stability. The method will
converge to 0 faster if all the eigenvalue are clustered near the origin.
This method can easily be extended to include more colors. A common practice is to choose
colors such that no nodes has neighbors of the same color. It is desired in such cases to minimize
the number of colors so as to reduce the number of iteration steps.
66 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Theorem 5.5.1 Suppose the condition number is κ(A) = λmax (A)/λmin (A), since A is Symmetric
Positive Definite, ∀x0 , suppose x∗ is a solution to Ax = b, then
√
∗ ∗ κ−1 m
||x − xm ||A ≤ 2||x − x0 ||A ( √ ) ,
κ+1
where ||V ||A = V T AV
√
Therefore, ||em || ≤ 2||e0 || · ( √κ−1
κ+1
)m .
Another high order iterative method is Chebyshev iterative method. We refer interested readers
to the book by Own Axelsson (Iterative Solution Methods, Cambridge University Press). Conjugate
gradient method is a special Krylov subspace method. Other examples of Krylov subspace are
GMRES (Generalized Minimum Residual Method) and Lanczos Methods.
5.6 Preconditioning
Preconditioning is important in reducing the number of iterations needed to converge in many
iterative methods. Put more precisely, preconditioning makes iterative methods possible in practice.
Given a linear system Ax = b a parallel preconditioner is an invertible matrix C satisfying the
following:
1. The inverse C −1 is relatively easy to compute. More precisely, after preprocessing the matrix
C, solving the linear system Cy = b0 is much easier than solving the system Ax = b. Further,
there are fast parallel solvers for Cy = b0 .
2. Iterative methods for solving the system C −1 Ax = C −1 b, such as, conjugate gradient1 should
converge much more quickly than they would for the system Ax = b.
Generally a preconditioner is intended to reduce κ(A).
Now the question is: how to choose a preconditioner C? There is no definite answer to this.
We list some of the popularly used preconditioning methods.
• The basic splitting matrix method and SOR can be viewed as preconditioning methods.
• Incomplete factorization preconditioning: the basic idea is to first choose a good “spar-
sity pattern” and perform factorization by Gaussian elimination. The method rejects those
fill-in entries that are either small enough (relative to diagonal entries) or in position outside
the sparsity pattern. In other words, we perform an approximate factorization L ∗ U ∗ and
use this product as a preconditioner. One effective variant is to perform block incomplete
factorization to obtain a preconditioner.
The incomplete factorization methods are often effective when tuned for a particular appli-
cation. The methods also suffer from being too highly problem-dependent and the condition
number usually improves by only a constant factor.
1
In general the matrix C −1 A is not symmetric. Thus the formal analysis uses the matrix LALT where C −1 = LLT
[?].
68 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Figure 5.8: Example conversion of the graph of matrix A (G(A)) to a subgraph (G(B))
• Subgraph preconditioning: The basic idea is to choose a subgraph of the graph defined
by the matrix of the linear system so that the linear system defined by the subgraph can be
solved efficiently and the edges of the original graph can be embedded in the subgraph with
small congestion and dilation, which implies small condition number of the preconditioned
matrix. In other words, the subgraph can “support” the original graph. An example of
converting a graph to a subgraph is shown in Figure 5.8.
The subgraph can be factored in O(n) space and time and applying the preconditioner takes
O(n) time per iteration.
• Block diagonal preconditioning: The observation of this method is that a matrix in many
applications can be naturally partitioned in the form of a 2 × 2 blocks
!
A11 A12
A=
A21 A22
Moreover, the linear system defined by A11 can be solved more efficiently. Block diagonal
preconditioning chooses a preconditioner with format
!
B11 0
C=
0 B22
with the condition that B11 and B22 are symmetric and
Block diagonal preconditioning methods are often used in conjunction with domain decom-
position technique. We can generalize the 2-block formula to multi-blocks, which correspond
to multi-region partition in the domain decomposition.
• Sparse approximate inverses: Sparse approximate inverses (B −1 ) of A can be computed
such that A ≈ B −1 . This inverse is computed explicitly and the quantity ||B −1 A − I||F is
minimized in parallel (by columns). This value of B −1 can then be used as a preconditioner.
This method has the advantage of being very parallel, but suffers from poor effectiveness in
some situations.
Preface 69
1. The inner loop (over rows) has no indirect addressing. (Sparse Level 1 BLAS is replaced by
dense Level 1 BLAS.)
2. The outer loop (over columns in the supernode) can be unrolled to save memory references.
(Level 1 BLAS is replaced by Level 2 BLAS.)
Supernodes as the destination of updates help because of the following:
3. Elements of the source supernode can be reused in multiple columns of the destination su-
pernode to reduce cache misses. (Level 2 BLAS is replaced by Level 3 BLAS.)
Supernodes in sparse Cholesky can be determined during symbolic factorization, before the
numeric factorization begins. However, in sparse LU, the nonzero structure cannot be predicted
before numeric factorization, so we must identify supernodes on the fly. Furthermore, since the
factors L and U are no longer transposes of each other, we must generalize the definition of a
supernode.
T1 Same row and column structures: A supernode is a range (r : s) of columns of L and rows of
U , such that the diagonal block F (r : s, r : s) is full, and outside that block all the columns
of L in the range have the same structure and all the rows of U in the range have the same
structure. T1 supernodes make it possible to do sup-sup updates, realizing all three benefits.
Figure 5.10: A sample matrix and its LU factors. Diagonal elements a55 and a88 are zero.
Figure 5.11: Supernodal structure bydef initionT 2 of the factors of the sample matrix.
72 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Supernode-Column Updated
Figure 5.12 sketches the sup-col algorithm. The only difference from the col-col algorithm is that
all the updates to a column from a single supernode are done together. Consider a supernode (r : s)
that updates column j. The coefficients of the updates are the values from a segment of column j
of U , namely U (r : s, j). The nonzero structure of such a segment is particularly simple: all the
nonzeros are contiguous, and follow all the zeros. Thus, if k is the index of the first nonzero row
in U (r : s, j), the updates to column j from supernode (r : s) come from columns k through s.
Since the supernode is stored as a dense matrix, these updates can be performed by a dense lower
triangular solve (with the matrix L(k : s, k : s)) and a dense matrix-vector multiplication (with the
matrix L(s + 1 : n, k : s)). The symbolic phase determines the value of k, that is, the position of
the first nonzero in the segment U (r : s, j).
The advantages of using sup-col updates are similar to those in the symmetric case. Efficient
Level 2 BLAS matrix-vector kernels can be used for the triangular solve and matrix-vector multiply.
Furthermore, all the updates from the supernodal columns can be collected in a dense vector before
doing a single scatter into the target vector. This reduces the amount of indirect addressing.
Supernode-Panel Updates
We can improve the sup-col algorithm further on machines with a memory hierarchy by changing
the data access pattern. The data we are accessing in the inner loop (lines 5-9 of Figure 5.12)
include the destination column j and all the updating supernodes (r : s) to the left of column j.
Column j is accessed many times, while each supernode (r : s) is used only once. In practice,
the number of nonzero elements in column j is much less than that in the updating supernodes.
Therefore, the access pattern given by this loop provides little opportunity to reuse cached data.
In particular, the same supernode (r : s) may be needed to update both columns j and j + 1.
But when we factor the (j+1)th column (in the next iteration of the outer loop), we will have to
fetch supernode (r : s) again from memory, instead of from cache (unless the supernodes are small
compared to the cache).
Panels
To exploit memory locality, we factor several columns (say w of them) at a time in the outer loop,
so that one updating supernode (r : s) can be used to update as many of the w columns as possible.
We refer to these w consecutive columns as a panel to differentiate them from a supernode, the
row structures of these columns may not be correlated in any fashion, and the boundaries between
panels may be different from those between supernodes. The new method requires rewriting the
doubly nested loop as the triple loop shown in Figure 5.13.
The structure of each sup-col update is the same as in the sup-col algorithm. For each supernode
(r : s) to the left of column j, if ukj 6= 0 for some r ≤ k ≤ s, then uij 6= 0 for all k ≤ i ≤ s.
Therefore, the nonzero structure of the panel of U consists of dense column segments that are
row-wise separated by supernodal boundaries, as in Figure 5.13. Thus, it is sufficient for the
symbolic factorization algorithm to record only the first nonzero position of each column segment.
74 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
As detailed in section 4.4, symbolic factorization is applied to all the columns in a panel at once,
over all the updating supernodes, before the numeric factorization step.
In dense factorization, the entire supernode-panel update in lines 3-7 of Figure 5.13 would
be implemented as two Level 3 BLAS calls: a dense triangular solve with w right-hand sides,
followed by a dense matrix-matrix multiply. In the sparse case, this is not possible, because the
different sup-col updates begin at different positions k within the supernode, and the submatrix
U (r : s, j : j + w − 1) is not dense. Thus the sparse supernode-panel algorithm still calls the Level
2 BLAS. However, we get similar cache benefits to those from the Level 3 BLAS, at the cost of
doing the loop reorganization ourselves. Thus we sometimes call the kernel of this algorithm a
”BLAS-2 12 ” method.
In the doubly nested loop (lines 3-7 of Figure 5.13), the ideal circumstance is that all w columns
in the panel require updates from supernode (r : s). Then this supernode will be used w times
before it is forced out of the cache. There is a trade-off between the value of w and the size of the
cache. For this scheme to work efficiently, we need to ensure that the nonzeros in the w columns do
not cause cache thrashing. That is, we must keep w small enough so that all the data accessed in
this doubly nested loop fit in cache. Otherwise, the cache interference between the source supernode
and the destination panel can offset the benefit of the new algorithm.
|m| denotes the number of words in m. Each message m has a source processor src(m) and a
destination processor dest(m), both elements of V .
For m ∈ M , let d(m) denote the length of the path taken by m from the source of the message
m to its destination. We assume that each message takes a certain path of links from its source to
its destination processor. Let p(m) = (`1 , `2 , . . . , `d(m) ) be the path taken by message m. For any
link ` ∈ L, let the set of messages whose paths utilize `, {m ∈ M | ` ∈ p(m)}, be denoted M (`).
The following are obviously lower bounds on the completion time of the computation. The first
three bounds are computable from the set of message M , each of which is characterized by its size
and its endpoints. The last depends on knowledge of the paths p(M ) taken by the messages.
1. (Average flux)
P
m∈M |m| · d(m)
· β.
|L|
This is the total flux of data, measured in word-hops, divided by the machine’s total commu-
nication bandwidth, L/β.
and
X
f lux(V0 , V1 ) ≡ |m| .
{m∈M | src(m)∈Vi ,dest(m)∈V1−i }
The bound is
f lux(V0 , V1 )
· β.
sep(V0 , V1 )
This is the number of words that cross from one part of the machine to the other, divided by
the bandwidth of the wires that link them.
X
max |m|β0 .
v∈V
src(m) = v
This is a lower bound on the communication time for the processor with the most traffic into
or out of it.
4. (Edge contention)
X
max |m|β.
`∈L
m∈M (`)
This is a lower bound on the time needed by the most heavily used wire to handle all its
traffic.
Preface 77
Of course, the actual communication time may be greater than any of the bounds. In particular,
the communication resources (the wires in the machine) need to be scheduled. This can be done
dynamically or, when the set of messages is known in advance, statically. With detailed knowledge
of the schedule of use of the wires, better bounds can be obtained. For the purposes of analysis
of algorithms and assignment of tasks to processors, however, we have found this more realistic
approach to be unnecessarily cumbersome. We prefer to use the four bounds above, which depend
only on the integrated (i.e. time-independent) information M and, in the case of the edge-contention
bound, the paths p(M ). In fact, in the work below, we won’t assume knowledge of paths and we
won’t use the edge contention bound.
1. L := A
2. for k = 1 to
√ N do
3. Lkk := Lkk
4. for i = k + 1 to N do
5. Lik := Lik L−1
kk
6. for j = k + 1 to N do
7. for i = j to N do
8. Lij := Lij − Lik LTjk
We can let the elements Lij be scalars, in which case this is the usual or “point” Cholesky algorithm.
Or we can take Lij to be a block, obtained by dividing the rows into contiguous subsets and making
the same decomposition
√ of the columns, so that diagonal blocks are square. In the block case, the
computation of Lkk (Step 3) returns the (point) Cholesky factor of the SPD block Lkk . If A is
sparse (has mostly zero entries) then L will be sparse too, although less so than A. In that case,
only the non-zero entries in the sparse factor L are stored, and the multiplication/division in lines
5 and 8 are omitted if they compute zeros.
Mapping columns
Assume that the columns of a dense symmetric matrix of order N are mapped to processors
cyclically: column j is stored in processor map(j) ≡ j mod p. Consider communication costs on
two-dimensional grid or toroidal machines. Suppose that p is a perfect square and that the machine
√ √
is a p × p grid. Consider a mapping of the computation in which the operations in line 8 are
performed by processor map(j). After performing the operations in line 5, processor map(k) must
send column k to all processors {map(j) | j > k}.
Let us fix our attention on 2D grids. There are L = 2p + O(1) links. A column can be broadcast
from its source to all other processors through a spanning tree of the machine, a tree of total length
p reaching all the processors. Every matrix element will therefore travel over p − 1 links, so the
total information flux is (1/2)N 2 p and the average flux bound is (1/4)N 2 β.
78 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
1 2
Arrivals 4 N β0
1 2
Average flux 4N β
1 1
Edge contention N 2β pr + pc
Only O(N 2 /p) words leave any processor. If N p, processors must accept almost the whole
(1/2)N 2 words of L as arriving columns. The bandwidth per processor is β0 , so the arrivals bound
is (1/2)N 2 β0 seconds. If N ≈ p the bound drops to half that, (1/4)N 2 β0 seconds. We summarize
these bounds for 2D grids in Table 5.1.
We can immediately conclude that this is a nonscalable distributed algorithm. We may not
take p > Nβφ and still achieve high efficiency.
Mapping blocks
Dongarra, Van de Geijn, and Walker [26] have shown that on the Intel Touchstone Delta ma-
chine (p = 528), mapping blocks is better than mapping columns in LU factorization. In such a
mapping, we view the machine as an pr × pc grid and we map elements Aij and Lij to processor
(mapr(i), mapc(j)). We assume a cyclic mappings here: mapr(i) ≡ i mod p r and similarly for
mapc.
The analysis of the preceding section may now be done for this mapping. Results are summarized
√
in Table 5.2. With pr and pc both O( p), the communication time drops like O(p−1/2 ). With this
mapping, the algorithm is scalable even when β φ. Now, with p = O(N 2 ), both the compute
time and the communication lower bounds agree; they are O(N ). Therefore, we remain efficient
when storage per processor is O(1). (This scalable algorithm for distributed Cholesky is due to
O’Leary and Stewart [72].)
of the computational load and modest efficiency. Heuristic remapping of the block rows and columns
can remove load imbalance as a cause of inefficiency.
Several researchers have obtained excellent performance using a block-oriented approach, both
on fine-grained, massively-parallel SIMD machines [23] and on coarse-grained, highly-parallel
MIMD machines [82]. A block mapping maps rectangular blocks of the sparse matrix to pro-
cessors. A 2-D mapping views the machine as a 2-D pr × pc processor grid, whose members are
denoted p(i, j). To date, the 2-D cyclic (also called torus-wrap) mapping has been used: block L ij
resides at processor p(i mod pr , j mod pc ). All blocks in a given block row are mapped to the same
row of processors, and all elements of a block column to a single processor column. Communication
volumes grow as the square root of the number of processors, versus linearly for the 1-D mapping;
2-D mappings also asymptotically reduce the critical path length. These advantages accrue even
when the underlying machine has some interconnection network whose topology is not a grid.
A 2-D cyclic mapping, however, produces significant load imbalance that severely limits achieved
efficiency. On systems (such as the Intel Paragon) with high interprocessor communication band-
width this load imbalance limits efficiency to a greater degree than communication or want of
parallelism.
An alternative, heuristic 2-D block mapping succeeds in reducing load imbalance to a point
where it is no longer the most serious bottleneck in the computation. On the Intel Paragon the
block mapping heuristic produces a roughly 20% increase in performance compared with the cyclic
mapping.
In addition, a scheduling strategy for determining the order in which available tasks are per-
formed adds another 10% improvement.
0.8
o balance, P=100
0.7 + efficiency, P=100
* balance, P=64
0.5
0.4
0.3
0.2
0.1
0 2 4 6 8 10 12
MATRIX NUMBER
Figure 5.14: Efficiency and overall balance on the Paragon system (B = 48).
Our experiments employ a set of test matrices including two dense matrices (DENSE1024 and
DENSE2048), two 2-D grid problems (GRID150 and GRID300), two 3-D grid problems (CUBE30
and CUBE35), and 4 irregular sparse matrices from the Harwell-Boeing sparse matrix test set
[27]. Nested
√ dissection or minimum degree orderings are used. In all our experiments, we choose
pr = pc = P , and we use a block size of 48. All Mflops measurements presented here are computed
by dividing the operation counts of the best known sequential algorithm by parallel runtimes. Our
experiments were performed on an Intel Paragon, using hand-optimized versions of the Level-3
BLAS for almost all arithmetic.
0.9
0.8
0.7
LOAD BALANCE
0.6
0.5
0.4
Figure 5.15: Efficiency bounds for 2-D cyclic mapping due to row, column and diagonal imbalances
(P = 64, B = 48).
P
row balance by worktotal /pr · workrowmax , where workrowmax = maxr i:RowM ap[i]=r RowW ork[i].
This row balance statistic gives the best possible overall balance (and hence efficiency), obtained
only if there is perfect load balance within each processor row. It isolates load imbalance due to an
overloaded processor row caused by a poor row mapping. An analogous expression gives column
balance, and a third analogous expression gives diagonal balance. (Diagonal d is made up of the
set of processors p(i, j) for which (i − j) mod pr = d.) While these three aggregate measures of
load balance are only upper bounds on overall balance, the data we present later make it clear that
improving these three measures of balance will in general improve the overall load balance.
Figure 2 shows the row, column, and diagonal balances with a 2-D cyclic mapping of the
benchmark matrices on 64 processors. Diagonal imbalance is the most severe, followed by row
imbalance, followed by column imbalance.
These data can be better understood by considering dense matrices as examples (although the
following observations apply to a considerable degree to sparse matrices as well). Row imbalance
is due mainly to the fact that RowW ork[i], the amount of work associated with a row of blocks,
increases with increasing i. More precisely, since work[i, j] increases linearly with j and the number
of blocks in a row increases linearly with i, it follows that RowW ork[i] increases quadratically in i.
Thus, the processor row that receives the last block row in the matrix receives significantly more
work than the processor row immediately following it in the cyclic ordering, resulting in significant
row imbalance. Column imbalance is not nearly as severe as row imbalance. The reason, we believe,
is that while the work associated with blocks in a column increases linearly with the column number
j, the number of blocks in the column decreases linearly with j. As a result, ColW ork[j] is neither
strictly increasing nor strictly decreasing. In the experiments, row balance is indeed poorer than
column balance. Note that the reason for the row and column imbalance is not that the 2-D cyclic
mapping is an SC mapping; rather, we have significant imbalance because the mapping functions
RowM ap and ColM ap are each poorly chosen.
To better understand diagonal imbalance, one should note that blocks on the diagonal of the
matrix are mapped exclusively to processors on the main diagonal of the processor grid. Blocks
just below the diagonal are mapped exclusively to processors just below the main diagonal of the
processor grid. These diagonal and sub-diagonal blocks are among the most work-intensive blocks in
the matrix. In sparse problems, moreover, the diagonal blocks are the only ones that are guaranteed
to be dense. (For the two dense test matrices, diagonal balance is not significantly worse than row
balance.) The remarks we make about diagonal blocks and diagonal processors apply to any SC
82 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
mapping, and do not depend on the use of a cyclic function RowM ap(i) = i mod pr .
original heuristic). Unfortunately, realized performance did not improve. This result indicates
that load balance is not the most important performance bottleneck once our original heuristic is
applied.
A very simple alternate approach reduces imbalance by performing cyclic row and column
mappings on a processor grid whose dimensions pc and pr are relatively prime; this reduces diagonal
imbalance. We tried this using 7 × 9 and 9 × 11 processor grids (using one fewer processor that for
our earlier experiments with P = 64 and P = 100.) The improvement in performance is somewhat
lower than that achieved with our earlier remapping heuristic (17% and 18% mean improvement
on 63 and 99 processors versus 20% and 24% on 64 and 100 processors). On the other hand, the
mapping needn’t be computed.
Parallel Machines
A parallel computer is a connected configuration of processors and memories. The choice space
available to a computer architect includes the network topology, the node processor, the address-
space organization, and the memory structure. These choices are based on the parallel computation
model, the current technology, and marketing decisions.
No matter what the pace of change, it is impossible to make intelligent decisions about parallel
computers right now without some knowledge of their architecture. For more advanced treatment of
computer architecture we recommend Kai Hwang’s Advanced Computer Architecture and Parallel
Computer Architecture by Gupta, Singh, and Culler.
One may gauge what architectures are important today by the Top500 Supercomputer1 list
published by Meuer, Strohmaier, Dongarra and Simon. The secret is to learn to read between the
lines. There are three kinds of machines on The November 2003 Top 500 list:
Vector Supercomputers, Single Instruction Multiple Data (SIMD) Machines and SMPs are no
longer present on the list but used to be important in previous versions.
How can one simplify (and maybe grossly oversimplify) the current situation? Perhaps by
pointing out that the world’s fastest machines are mostly clusters. Perhaps it will be helpful to the
reader to list some of the most important machines first sorted by type, and then by highest rank
in the top 500 list. We did this in 1997 and also 2003.
1
http://www.top500.org
85
86 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Remembering that a computer is a processor and memory, really a processor with cache and
memory, it makes sense to call a set of such “computers” linked by a network a multicomputer.
Figure 6.1 shows 1) a basic computer which is just a processor and memory and also 2) a fancier
computer where the processor has cache, and there is auxiliary disk memory. To the right, we
picture 3) a three processor multicomputer. The line on the right is meant to indicate the network.
These machines are sometimes called distributed memory multiprocessors. We can further
distinguish between DMM’s based on how each processor addresses memory. We call this the
private/shared memory issue:
M
P
C
2) P C
D
M
M
P
C
memory. When a processor wants to read data from another processor’s memory, the
owning processor must send the data over the network to the processor that wants it.
Such machines are said to have private memory. A close analog is that in my office, I
can easily find information that is located on my desk (my memory) but I have to make
a direct request via my telephone (ie., I must dial a phone number) to find information
that is not in my office. And, I have to hope that the other person is in his or her office,
waiting for the phone to ring. In other words, I need the co-operation of the other
active party (the person or processor) to be able to read or write information located
in another place (the office or memory).
The alternative is a machine in which every processor can directly address every in-
stance of memory. Such a machine is said to have a shared address space, and sometimes
informally shared memory, though the latter terminology is misleading as it may easily
be confused with machines where the memory is physically shared. On a shared address
space machine, each processor can load data from or store data into the memory of any
processor, without the active cooperation of that processor. When a processor requests
memory that is not local to it, a piece of hardware intervenes and fetches the data over
the network. Returning to the office analogy, it would be as if I asked to view some
information that happened to not be in my office, and some special assistant actually
dialed the phone number for me without my even knowing about it, and got a hold of
the special assistant in the other office, (and these special assistants never leave to take
a coffee break or do other work) who provided the information.
Most distributed memory machines have private addressing. One notable exception
is the Cray T3D and the Fujitsu VPP500 which have shared physical addresses.
Clusters are built from independent computers integrated through an after-market network. The
Preface 89
idea of providing COTS (Commodity off the shelf) base systems to satisfy specific computational
requirements evolved as a market reaction to MPPs with the thought the cost might be cheaper.
Clusters were considered to have slower communications compared to the specialized machines but
they have caught up fast and now outperform most specialized machines.
NOWs stands for Network of Workstations. Any collection of workstations is likely to be
networked together: this is the cheapest way for many of us to work on parallel machines given
that the networks of workstations already exist where most of us work.
The first Beowulf cluster was built by Thomas Sterling and Don Becker at the Goddard Space
Flight Center in Greenbelt Maryland, which is a cluster computer consisting of 16 DX4 processors
connected by Ethernet. They named this machine Beowulf (a legendary Geatish warrior and hero
of the Old English poem Beowulf). Now people use “Beowulf cluster” to denote a cluster of PCs
or workstations interconnected by a private high-speed network, which is dedicated to running
high-performance computing tasks. Beowulf clusters usually run a free-software operating system
like Linux or FreeBSD, though windows Beowulfs exist.
Notice that we have already used the word “shared” to refer to the shared address space possible
in in a distributed memory computer. Sometimes the memory hardware in a machine does not
obviously belong to any one processor. We then say the memory is central, though some authors
may use the word “shared.” Therefore, for us, the central/distributed distinction is one of system
architecture, while the shared/private distinction mentioned already in the distributed context
refers to addressing.
Microprocessor machines known as symmetric multiprocessors (SMP) are becoming typical now
as mid-sized compute servers; this seems certain to continue to be an important machine design.
On these machines each processor has its own cache while the main memory is central. There is
no one “front-end” or “master” processor, so that every processor looks like every other processor.
This is the “symmetry.” To be precise, the symmetry is that every processor has equal access to
the operating system. This means, for example, that each processor can independently prompt the
user for input, or read a file, or see what the mouse is doing.
The microprocessors in an SMP themselves have caches built right onto the chips, so these
caches act like distributed, low-latency, high-bandwidth memories, giving the system many of the
important performance characteristics of distributed memory. Therefore if one insists on being
precise, it is not all of the memory that is central, merely the main memory. Such systems are said
to have non-uniform memory access (NUMA).
A big research issue for shared memory machines is the cache coherence problem. All fast
processors today have caches. Suppose the cache can contain a copy of any memory location in
the machine. Since the caches are distributed, it is possible that P2 can overwrite the value of x
in P2’s own cache and main memory, while P1 might not see this new updated value if P1 only
90 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
P
P
C
C
B
B
B
M
B
C
C
P P
Figure 6.2: A four processor SMP (B denotes the bus between the central memory and the proces-
sor’s cache
looks at its own cache. Coherent caching means that when the write to x occurs, any cached copy
of x will be tracked down by the hardware and invalidated – i.e. the copies are thrown out of their
caches. Any read of x that occurs later will have to go back to its home memory location to get
its new value. Maintenance of cache coherence is expensive for scalable shared memory machines.
Today, only the HP Convex machine has scalable, cache coherent, shared memory. Other vendors
of scalable, shared memory systems (Kendall Square Research, Evans and Sutherland, BBN) have
gone out of the business. Another, Cray, makes a machine (the Cray T3E) in which the caches can
only keep copies of local memory locations.
SMPs are often thought of as not scalable (performance peaks at a fairly small number of pro-
cessors), because as you add processors, you quickly saturate the bus connecting the memory to
the processors.
Whenever anybody has a collection of machines, it is always natural to try to hook them up
together. Therefore any arbitrary collection of computers can become one big distributed com-
puter. When all the nodes are the same, we say that we have a homogeneous arrangement. It has
recently become popular to form high speed connections between SMPs, known as Constellations
of SMPs or SMP arrays. Sometimes the nodes have different architectures creating a heterogeneous
situation. Under these circumstances, it is sometimes necessary to worry about the explicit format
of data as it is passed from machine to machine.
SIMD machines:
In the late 1980’s, there were debates over SIMD versus MIMD. (Either pronounced as SIM-
dee/MIM-dee or by reading the letters es-eye-em-dee/em-eye-em-dee.) These two acronyms coined
by Flynn in his classification of machines refer to Single Instruction Multiple Data and Mul-
tiple Instruction Multiple Data. The second two letters, “MD” for multiple data, refer to the
ability to work on more than one operand at a time. The “SI” or “MI” refer to the ability of a
processor to issue instructions of its own. Most current machines are MIMD machines. They are
Preface 91
built from microprocessors that are designed to issue instructions on their own. One might say that
each processor has a brain of its own and can proceed to compute anything it likes independent of
what the other processors are doing. On a SIMD machine, every processor is executing the same
instruction, an add say, but it is executing on different data.
SIMD machines need not be as rigid as they sound. For example, each processor had the ability
to not store the result of an operation. This was called em context. If the contest was false, the
result was not stored, and the processor appeared to not execute that instruction. Also the CM-2
had the ability to do indirect addressing, meaning that the physical address used by a processor to
load a value for an add, say, need not be constant over the processors.
The most important SIMD machines were the Connection Machines 1 and 2 produced by
Thinking Machines Corporation, and the MasPar MP-1 and 2. The SIMD market received a
serious blow in 1992, when TMC announced that the CM-5 would be a MIMD machine.
Now the debates are over. MIMD has won. The prevailing theory is that because of the
tremendous investment by the personal computer industry in commodity microprocessors, it will
be impossible to stay on the same steep curve of improving performance using any other proces-
sor technology. “No one will survive the attack of the killer micros!” said Eugene Brooks of the
Lawrence Livermore National Lab. He was right. The supercomputing market does not seem to
be large enough to allow vendors to build their own custom processors. And it is not realistic
or profitable to build an SIMD machine out of these microprocessors. Furthermore, MIMD is
more flexible than SIMD; there seem to be no big enough market niches to support even a single
significant vendor of SIMD machines.
A close look at the SIMD argument:
In some respects, SIMD machines are faster from the communications viewpoint. They
can communicate with minimal latency and very high bandwidth because the processors
are always in synch. The Maspar was able to do a circular shift of a distributed array,
or a broadcast, in less time than it took to do a floating point addition. So far as we
are aware, no MIMD machine in 1996 has a latency as small as the 24 µsec overhead
required for one hop in the 1988 CM-2 or the 8 µsec latency on the Maspar MP-2.
Admitting that certain applications are more suited to SIMD than others, we were
among many who thought that SIMD machines ought to be cheaper to produce in that
one need not devote so much chip real estate to the ability to issue instructions. One
would not have to replicate the program in every machine’s memory. And communi-
cation would be more efficient in SIMD machines. Pushing this theory, the potentially
fastest machines (measured in terms of raw performance if not total flexibility) should
be SIMD machines. In its day, the MP-2 was the world’s most cost-effective machine,
as measured by the NAS Parallel Benchmarks. These advantages, however, do not seem
to have been enough to overcome the relentless, amazing, and wonderful performance
gains of the “killer micros”.
Continuing with the Flynn classification (for historical purposes) Single Instruction Single
Data or SISD denotes the sequential (or Von Neumann) machines that are on most of our desktops
and in most of our living rooms. (Though most architectures show some amount of parallelism at
some level or another.) Finally, there is Multiple Instruction Single Data or MISD, a class
which seems to be without any extant member although some have tried to fit systolic arrays into
this ill-fitting suit.
There have also been hybrids; the PASM Project (at Purdue University) has investigated the
problem of running MIMD applications on SIMD hardware! There is, of course, some performance
92 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
penalty.
Vector Supercomputers:
A vector computer today is a central, shared memory MIMD machine in which every proces-
sor has some pipelined arithmetic units and has vector instructions in its repertoire. A vector
instruction is something like “add the 64 elements of vector register 1 to the 64 elements of vec-
tor register 2”, or “load the 64 elements of vector register 1 from the 64 memory locations at
addresses x, x + 10, x + 20, . . . , x + 630.” Vector instructions have two advantages: fewer instruc-
tions fetched, decoded, and issued (since one instruction accomplishes a lot of computation), and
predictable memory accesses that can be optimized for high memory bandwidth. Clearly, a single
vector processor, because it performs identical operations in a vector instruction, has some features
in common with SIMD machines. If the vector registers have p words each, then a vector processor
may be viewed as an SIMD machine with shared, central memory, having p processors.
Vector supercomputes have very fancy integrated circuit technology (bipolar ECL logic, fast but
power hungry) in the processor and the memory, giving very high performance compared with other
processor technologies; however, that gap has now eroded to the point that for most applications,
fast microprocessors are within a factor of two in performance. Vector supercomputer processors
are expensive and require unusual cooling technologies. Machines built of gallium arsenide, or using
Josephson junction technology have also been tried, and none has been able to compete success-
fully with the silicon, CMOS (complementary, metal-oxide semiconductor) technology used in the
PC and workstation microprocessors. Thus, from 1975 through the late 1980s, supercomputers
were machines that derived their speed from uniprocessor performance, gained through the use of
special hardware technologies; now supercomputer technology is the same as PC technology, and
parallelism has become the route to performance.
machines. And communication cost, whenever there is NUMA, is also a critical issue. It has been
said that the three most important issues in parallel algorithms are “locality, locality, and locality”. 3
One factor that complicates the discussion is that a layer of software, at the operating system
level or just above it, can provide virtual shared addressing on a private address machine by using
interrupts to get the help of an owning processor when a remote processor wants to load or store
data to its memory. A different piece of software can also segregate the shared address space of a
machine into chunks, one per processor, and confine all loads and stores by a processor to its own
chunk, while using private address space mechanisms like message passing to access data in other
chunks. (As you can imagine, hybrid machines have been built, with some amount of shared and
private memory.)
√
send a message from a processor to its most distant processor is p and log p, respectively, for
2D grid and hypercube of p processors. The node degree of a 2D grid is 4, while the degree of a
hypercube is log p. Another important criterion for the performance of a network topology is its
bisection bandwidth, which is the minimum communication capacity of a set of links whose removal
partitions the network into two equal halves. Assuming unit capacity of each direct link, a 2D and
√
3D grid of p nodes has bisection bandwidth p and p2/3 respectively, while a hypercube of p nodes
has bisection bandwidth Θ(p/ log p). (See FTL page 394)
There is an obvious cost / performance trade-off to make in choosing machine topology. A
hypercube is much more expensive to build than a two dimensional grid of the same size. An
important study done by Bill Dally at Caltech showed that for randomly generated message traffic,
a grid could perform better and be cheaper to build. Dally assumed that the number of data
signals per processor was fixed, and could be organized into either four “wide” channels in a grid
topology or log n “narrow” channels (in the first hypercubes, the data channels were bit-serial) in
a hypercube. The grid won, because too the average utilization of the hypercube channels was
too low: the wires, probably the most critical resource in the parallel machine, were sitting idle.
Furthermore, the work on routing technology at Caltech and elsewhere in the mid 80’s resulted in a
family of hardware routers that delivered messages with very low latency even though the length of
the path involved many “hops” through the machines. For the earliest multicomputers used “store
and forward” networks, in which a message sent from A through B to C was copied into and out
of the memory of the intermediate node B (and any others on the path): this causes very large
latencies that grew in proportion to the number of hops. Later routers, including those used in
todays networks, have a “virtual circuit” capability that avoids this copying and results in small
latencies.
Does topology make any real difference to the performance of parallel machines in practice?
Some may say “yes” and some may say “no”. Due to the small size (less than 512 nodes) of
most parallel machine configurations and large software overhead, it is often hard to measure the
performance of interconnection topologies at the user level.
• Cray T3E (MIMD, distributed memory, 3D torus, uses Digital Alpha microprocessors), C90
(vector), Cray YMP, from Cray Research, Eagan, Minnesota.
• Thinking Machine CM-2 (SIMD, distributed memory, almost a hypercube ) and CM-5 (SIMD
and MIMD, distributed memory, Sparc processors with added vector units, fat tree) from
Thinking Machines Corporation, Cambridge, Massachusetts.
• Intel Delta, Intel Paragon (mesh structure, distributed memory, MIMD), from Intel Corpo-
rations, Beaverton, Oregon. Based on Intel i860 RISC, but new machines based on the P6.
Recently sold world’s largest computer (over 6,000 P6 processors) to the US Dept of Energy
for use in nuclear weapons stockpile simulations.
• IBM SP-1, SP2, (clusters, distributed memory, MIMD, based on IBM RS/6000 processor),
from IBM, Kingston, New York.
• MasPar, MP-2 (SIMD, small enough to sit next to a desk), by MasPar, Santa Clara, Califor-
nia.
• KSR-2 (global addressable memory, hierarchical rings, SIMD and MIMD) by Kendall Square,
Waltham, Massachusetts. Now out of the business.
• NEC SX-4 (multi-processor vector, shared and distributed memory), by NEC, Japan.
• Tera MTA (MPP vector, shared memory, multithreads, 3D torus), by Tera Computer Com-
pany, Seattle, Washington. A novel architecture which uses the ability to make very fast
context switches between threads to hide latency of access to the memory.
• Meiko CS-2HA (shared memory, multistage switch network, local I/O device), by Meiko
Concord, Massachusetts and Bristol UK.
• Cray-3 (gallium arsenide integrated circuits, multiprocessor, vector) by Cray Computer Cor-
poration, Colorado Spring, Colorado. Now out of the business.
• Distributed.net: to use the idle processing time of its thousands member computers to solve
computationally intensive problems. Its computing power now is equivalent to that of “more
than 160000 PII 266MHz computers”.
96 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
• Google Compute: Runs as part of the Google Toolbar within a user’s browser. Detects spare
cycles on the machine and puts them to use solving scientific problems selected by Google.
Akamai Network consists of thousands servers spread globally that cache web pages and route
traffic away from congested areas. This idea was originated by Tom Leighton and Danny Lewin at
MIT.
Lecture 7
FFT
7.1 FFT
The Fast Fourier Transform is perhaps the most important subroutine in scientific computing. It
has applications ranging from multiplying numbers and polynomials to image and signal processing,
time series analysis, and the solution of linear systems and PDEs. There are tons of books on the
subject including two recent worderful ones by Charles van Loand and Briggs.
The discrete Fourier transform of a vector x is y = Fn x, where Fn is the n×n matrix whose entry
(Fn )jk = e−2πijk/n , j, k = 0 . . . n − 1. It is nearly always a good idea to use 0 based notation (as
with the C programming language) in the context of the discrete Fourier transform. The negative
exponent corresponds to Matlab’s definition. Indeed in matlab we obtain fn=fft(eye(n)).
A good example is
1 1 1 1
1 −i −1 i
F4 = .
1 −1 1 −1
1 i −1 −i
Sometimes it is convenient to denote (Fn )jk = ωnjk , where ωn = e−2π/n .
The Fourier matrix has more interesting properties than any matrix deserves to have. It is
symmetric (but not Hermitian). It is Vandermonde (but not ill-conditioned). It is unitary except
for a scale factor ( √1n Fn is unitary). In two ways the matrix is connected to group characters: the
matrix itself is the character table of the finite cyclic group, and the eigenvectors of the matrix are
determined from the character table of a multiplicative group.
The trivial way to do the Fourier transform is to compute the matrix-vector multiply requiring
n2 multipilications and roughly the same number of additions. Cooley and Tukey gave the first
O(n log n) time algorithm (actually the algorithm may be found in Gauss’ work) known today as
the FFT algorithm. We shall assume that n = 2p .
The Fourier matrix has the simple property that if Πn is an unshuffle operation, then
!
Fn/2 Dn Fn/2
Fn ΠTn = , (7.1)
Fn/2 −Dn Fn/2
n/2−1
where Dn is the diagonal matrix diag(1, ωn , . . . , ωn ).
One DFT algorithm is then simply: 1) unshuffle the vector 2) recursively apply the FFT algo-
rithm to the top half and the bottom half, then combine elements in the top part with corresponding
!
I Dn
elements in the bottom part (“the butterfly”) as prescribed by the matrix .
I −Dn
97
98 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Everybody has their favorite way to visualize the FFT algorithm. For us, the right way is to
think of the data as living on a hypercube. The algorithm is then, permute the cube, perform the
FFT on a pair of opposite faces, and then perform the butterfly, along edges across the dimension
connecting the opposite faces.
We now repeat the three steps of the recursive algorithm in index notation:
• Step 1: id−1 . . . i1 i0 → i0 id−1 . . . i1
• Step 2: i0 id−1 . . . i1 → i0 fft(id−1 . . . i1 )
• Step 3: → i¯0 fft(id−1 . . . i1 )
Here Step 1 is a data permutation, Step 2 refters to two FFTs, and Step 3 is the butterfly on
the high order bit.
In conventional notation:
n−1
X
yj = (Fn x)j = ωnjk xk
k=0
can be cut into the even and the odd parts:
m−1 m−1
!
X X
yj = ωn2jk x2k + ωnj ωn2jk x2k+1 ;
k=0 k=0
since ωn2 = ωm , the two sums are just FFT(xeven ) and FFT(xodd ). With this remark (see Fig. 1),
Pm−1 P
jk m−1 jk
yj = k=0 ωm x2k + ωnj ω x2k+1
Pm−1 jk Pk=0 m
m−1 jk
yj+m = k=0 ωm x2k − ωnj k=0 ω x
m 2k+1 .
Then the algorithm keeps recurring; the entire “communication” needed for an FFT on a vector of
length 8 can be seen in Fig. 2
....................
....
FFT(xeven ) +
...
yj
.. ..
....... .... ..
....... .. ..
....... ..............................
....... ....
..
....... ..
....... .......
....... ............
...........
.....
........ .............
.. .......
....... .......
....... ....... ........................
.......
ωnj FFT(xodd )
.....
−
...
.......
....... .....
...
...... ........
...
.. yj+m
..........
The number of operations for an FFT on a vector of length n equals to twice the number for
an FFT on length n/2 plus n/2 on the top level. As the solution of this recurrence, we get that
the total number of operations is 12 n log n.
Now we analyze the data motion required to perform the FFT. First we assume that to each
processor one element of the vector x is assigned. Later we discuss the “real-life” case when the
number of processors is less than n and hence each processor has some subset of elements. We also
discuss how FFT is implemented on the CM-2 and the CM-5.
The FFT always goes from high order bit to low order bit, i.e., there is a fundamental asymmetry
that is not evident in the figures below. This seems to be related to the fact that one can obtain
a subgroup of the cyclic group by alternating elements, but not by taking, say, the first half of the
elements.
Preface 99
x0 x
... ..... x x
........... .......x y
...
...
... ...
...
... ...........
......
...... ......
......
...... ...................
...........................
.. ...
.......... 0
... .... ...... ......... ............... ....................
x1 x
...
...
...
...
...
. x
....... ....
............
...
...
x
..........
.....
. . ..........
x y
..
.
...
...
...
...
...
... ...
. . . ...........
...... ...........
...... ..... ..
...
4
.
... ... ..... .. ..
.. ........ ... ................
... ... ..... .. ......
x2 x
..
.
... ...
. ......... ...
.
.
......x ..... ..... x
....... x y
...
2
.. .... ..... ..........
.......... ..........
.. ... ... ... ... .. .... ..... .......... ...................
.. .... .... . . ..... . .............
. .
.. .... ... ... ..... .. .
....
.... .............. ....................
x3 x
.. ...... ...... ....
...... x . ...
x
.. ...........
.
..... .......... x y
.
6
..
.. ..... .. .. ... ... ..... ..
.. .. . ...
....... ...... ....... ......
x.. . .. .... ..... ..... x x x y
x4 ... ...
. .. .. .
... .... .. .. ...
.
..
...
...
......
......
......
...... ......
......
......
......................
..........
.......... .................
.. ....
..........
......
1
... . .
. .. . . . . ......
. ......... ............... ...................
x5 x.
... ..
. ..
....
.. ...
..... x .
. .
..........
........
.
.
x
..........
....... . .......... x y
..
..
.. .. ..
..
..
..... ........... ............ . .....
.... ......... 5
. ... .. .. ............... .... ................
. .. ......... . . .....
x6 x
.. .. .. x ...
..... .....
......
x
........ x y
.....
..
3
.......... ..........
..........
..
..
.. ..
..... ..... .......... ..................
. . .. .. .........
. . .. ..... ... .............. ...................
x7 x
.. ...... x ...
x
.. ...........
.
..... .......... x y
..
7
Figure 7.2: FFT network for 8 elements. (Such a network is not built in practice)
0 0 1 1
block pos 0 1 0 1
...
...
...
...
...
...
parity 0 1 0 1 0 1 0 1
...
...
... ...
... ...
... ...
inside pos 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
previous x .x .x ...x ..x ..x .x ...x ..x .x .x ...x ..x ..x .x ...x ... ...
...... ... ... ...
. .. .. .. .. ...
... ... .. .. .. .. . .. .. ... .. .. ... .. . ... . ... .. ...
... ... .. ... ... ... . ... ... ... .. ... ... ... . ... ... ...
. . . ... . ... ...
phase ... ... ... ... .... ... ... ...
... ... .... ... ...
. ........ ... . ..... ... . ........ ... . ..... ...
... ...
... ... ... ... ... ... ...
. ... ..... ..... . ... ..... ..... . ... ..... ..... . ... ..... ..... .. ..
...
the i-th bit
...... .....
.. ...... ... .. ...... .....
.. ...... ... .. ...
. ...... . ..
..... . ...... . ..
..... ... ...
... .... ..... .... ... .... ..... ..... ... .... ..... .... ... .... ..... ..... ... ....
. . . ... . . . ... . . . ... . . . ... ... ...
next becomes first
...... . ... ....... ..... . ... .......
.. . ... .. . ...
... .. . ... .. . ...
... .. ..
... ..... ... ... ..... ... ..... ... ... ..... ... ...
. .... ... ... . .... ... ...
... . .... ... ... . .... ... ...
... ... ...
... ... ... ... ... ... ... ...
... ... ....
phase x x x x x x x x x x x x x x x x
.. . . .. . ...
.. .. . . .. . ...
.. ..
.
. . . . . . . . . . . . .. ...
... ...
... ...
.... ...
.. ...
... ...
........ ....
new block 0 1 0 1 0 1 0 1 ......... .
.
...
...
...
0 0 0 0 1 1 1 1 ...
...........
. ..
0 0 1 1 0 0 1 1 .......
To see why FFT reverses the bit order, let us have a look at the i-th segment of the FFT
network (Fig. 3). The input is divided into parts and the current input (top side) consists of FFT’s
of these parts. One “block” of the input consists of the same fixed output element of all the parts.
The i − 1 most significant bits of the input address determine this output element, while the least
significant bits the the part of the original input whose transforms are at this level.
The next step of the FFT computes the Fourier transform of twice larger parts; these consist
100 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Figure 7.4: Left: more than one element per processor. Right: one box is a 4 × 4 matrix multiply.
of an “even” and an “odd” original part. Parity is determined by the i-th most significant bit.
Now let us have a look at one unit of the network in Fig. 1; the two inputs correspond to the
same even and odd parts, while the two outputs are the possible “farthest” vector elements, they
differ in the most significant bit. What happens is that the i-th bit jumps first and becomes most
significant (see Fig. 3).
Now let us follow the data motion in the entire FFT network. Let us assume that the i-th input
element is assigned to processor i. Then after the second step a processor with binary address
ip ip−1 ip−2 . . . i1 i0 has the ip−1 ip ip−2 . . . i1 i0 -th data, the second bit jumps first. Then the third,
fourth, . . ., p-th bits all jump first and finally that processor has the i0 i1 i2 . . . ip−1 ip -th output
element.
7.1.3 Exercises
1. Verify equation (??).
2. Just for fun, find out about the FFT through Matlab.
We are big fans of the phone command for those students who do not already have a good
physical feeling for taking Fourier transforms. This command shows (and plays if you have
a speaker!) the signal generated when pressing a touch tone telephone in the United States
and many other countries. In the old days, when a pushbutton phone was broken apart, you
could see that pressing a key depressed one lever for an entire row and another lever for an
entire column. (For example, pressing 4 would depress the lever corresponding to the second
row and the lever corresponding to the first column.)
To look at the FFT matrix, in a way, plot(fft(eye(7)));axis(’square’).
3. In Matlab use the flops function to obtain a flops count for FFT’s for different power of 2
size FFT’s. Make you input complex. Guess a flop count of the form a+bn+c log n+dn log n.
Remembering that Matlab’s \ operator solves least squares problems, find a, b, c and d. Guess
whether Matlab is counting flops or using a formula.
If, as in the FFT algorithm, we assume that n = 2p , the matrix multiply of two n-by-n matrices
calls 7 multiplications of (n/2)-by-(n/2) matrices. Hence the time required for this algorithm is
O(nlog2 7 ) = O(n2.8074 ). Note that Strassen’s idea can further be improved (of course, with the loss
that several additions have to be made and the constant is impractically large) the current such
record is an O(n2.376 )-time algorithm.
A final note is that, again as in the FFT implementations, we do not recur and use Strassen’s
method with 2-by-2 matrices. For some sufficient p, we stop when we get 2p × 2p matrices and use
direct matrix multiply which vectorizes well on the machine.
• All-to-All Broadcast:
• Array Indexing or Permutation: There are two types of array indexing: the left array
indexing and the right array indexing.
Domain Decomposition
Domain decomposition is a term used by at least two different communities. Literally, the words
indicate the partitioning of a region. As we will see in Chapter ?? of this book, an important
computational geometry problem is to find good ways to partition a region. This is not what we
will discuss here.
In scientific computing, domain decomposition refers to the technique of solving partial differ-
ential equations using subroutines that solve problems on subdomains. Originally, a domain was
a contiguous region in space, but the idea has generalized to include any useful subset of the dis-
cretization points. Because of this generalization, the distinction between domain decomposition
and multigrid has become increasingly blurred.
Domain decomposition is an idea that is already useful on serial computers, but it takes on a
greater importance when one considers a parallel machine with, say, a handful of very powerful
processors. In this context, domain decomposition is a parallel divide-and-conquer approach to
solving the PDE.
To guide the reader, we quickly summarize the choice space that arises in the domain decom-
position literature. As usual a domain decomposition problem starts as a continuous problem on a
region and is disretized into a finite problem on a discrete domain.
We will take as our model problem the solution of the elliptic equation ∇ 2 u = f , where on
a region Ω which is the union of at least subdomains Ω1 and Ω2 . ∇2 is the Laplacian operator,
2 2
defined by ∇2 u = ∂∂xu2 + ∂∂yu2 . Domain decomposition ideas tend to be best developed for elliptic
problems, but may be applied in more general settings.
103
104 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Σ1 Ω1 σ2 σ1 Ω2 Σ2
Ω1
→←n
n1 2
Ω2
Ω2 Ω4
Ω1 Ω5
Ω3
1. Geometric Issues
Overlapping or non-overlapping regions
Geometric Discretization
Finite Difference or Finite Element
Matching or non-matching grids
2. Algorithmic Issues
Algebraic Discretization
Schwarz Approaches: Additive vs. Multiplicative
Substructuring Approaches
Accelerants
Domain Decomposition as a Preconditioner
Course (Hierarchical/Multilevel) Domains
3. Theoretical Considerations
0.5
0.4
0.3
0.2
0.1
0
0.5
1
−0.5 0.5
0
−0.5
Figure 8.5: Incorrect solution for non-overlaped problem. The result is not differentiable along the
boundary between the two regions.
be more conveniently discretized by covering it with polar graph paper. Under such a situation,
the grids do not match, and it becomes necessary to transfer points interior to Ω2 to the boundary
of Ω1 and vice versa. Figure 8.6 shows an example domain with non-mathing grids. Normally, grid
values are interpolated for this kind of grid line up pattern.
Once the domain is discretized, numerical algorithms must be formulated. There is a definite line
drawn between Schwarz (overlapping) and substructuring (non-overlapping) approaches.
2. Gauss-Seidel - Values at step n are used if available, otherwise the values are used from step
n−1
Gauss-Seidel uses applications the iteration:
1 X X
ui (n+1) = ui n + [fi − aij uj (n+1) − aij uj (n) ] ∀ i
aii j<i j>i
3. Red Black Ordering - If the grid is a checkerboard, solve all red points in parallel using black
values at n − 1, then solve all black points in parallel using red values at step n For the
checkerboard, this corresponds to the pair of iterations:
1 X
ui (n+1) = ui n + [fi − aij uj (n) ] ∀ i even
ai i j6=i
1 X
ui (n+1) = ui n + [fi − aij uj (n+1) ] ∀ i odd
ai i j6=i
Analogous block methods may be used on a domain that is decomposed into a number of multiple
regions. Each region is thought of as an element used to solve the larger problem. This is known
as block Jacobi, or block Gauss-Seidel.
1. Block Gauss-Seidel - Solve each region in series using the boundary values at n if available.
2. Block Jacobi - Solve each region on a separate processor in parallel and use boundary values
at n − 1. (Additive scheme)
3. Block coloring scheme - Color the regions so that like colors do not touch and solve all regions
with the same color in parallel. ( Multiplicative scheme )
The block Gauss-Seidel algorithm is called a multiplicative scheme for reasons to be explained
shortly. In a corresponding manner, the block Jacobi scheme is called an additive scheme.
d1 -b1-
b-
2
d2 -
Correspondingly, the matrix A (which of course would never be written down) has the form
The reader should find Ab1 ,b1 etc., on this picture. To further simplify notation, we write 1 and
2 for d1 and d2 ,1b and 2b for b1 and b2 , and also use only a single index for a diagonal block of a
matrix (i.e. A1 = A11 ).
Now that we have leisurely explained our notation, we may return to the algebra. Numerical
analysts like to turn problems that may seem new into ones that they are already familiar with. By
carefully writing down the equations for the procedure that we have described so far, it is possible
to relate the classical domain decomposition method to an iteration known as Richardson iteration.
Richardson iteration solves Au = f by computing uk+1 = uk + M (f − Auk ), where M is a “good”
approximation to A−1 . (Notice that if M = A−1 , the iteration converges in one step.)
112 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Ω1
I Ω2
Problem Domain
k+1/2
A2 uk+1
2 + A2,2b u2b = f2
Notice that values of uk+1/2 updated by the first equation, specifically the values on the boundary
of the second region, are used in the second equation.
With a few algebraic manipulations, we have
k+1/2
u1 = u1k−1 + A−1
1 (f − Au
k−1
)1
k+1/2
uk+1
2 = u2 + A−1
2 (f − Au
k+1/2
)2
This was already obviously a Gauss-Seidel like procedure, but those of you familiar with the alge-
braic form of Gauss-Seidel might be relieved to see the form here.
A roughly equivalent block Jacobi method has the form
k+1/2
u1 = u1k−1 + A−1
1 (f − Au
k−1
)1
k+1/2
uk2 = u2 + A−1 k
2 (f − Au )2
uk+1 = uk + (A−1 −1 k
1 + A2 )(f − Au ),
where the operators are understood to apply to the appropriate part of the vectors. It is here that
we see that the procedure we described is a Richardson iteration with operator M = A −1 −1
1 + A2 .
One of the direct methods to solve the above equation is to use LU or LDU factorization. We
will do an analogous procedure with blocks. We can rewrite A as,
I 0 0 I 0 0 A1 0 A1I
A= 0 I 0 0 I 0 0 A2 A2I
AI1 A−1
1 AI2 A−1
2 I 0 0 S 0 0 I
where,
S = AI − AI1 A−1 −1
1 A1I − AI2 A2 A2I
In the above example we had a simple 2D region with neat squares but in reality we might have
to solve on complicated 3D regions which have to be divided into tetrahedra with 2D regions at
the interfaces. The above concepts still hold.
114 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Getting to S −1 ,
! ! !
a b 1 0 a b
=
c d c/a 1 0 d − bc/a
where, d − bc/a is the Schur complement of d.
In Block form ! ! !
A B 1 0 A B
= −1
C D CA 1 0 D − CA−1 B
We have
S = AI − AI1 A−1 −1
1 A1I − AI2 A2 A2I
Arbitrarily break AI as
AI = A1I + A2I
Think of A as
A1 0 A1I 0 0 0
0 0 0 + 0 A2 A2I
1
AI1 0 AI 2
0 AI2 AI
and
S = S1 + S2
A−1
1 → Poisson solve on Ω1
A−1
2 → Poisson solve on Ω2
AI1 Ω1 → I
A21 Ω2 → I
A1I I → Ω1
A2I I → Ω2
Sv - Multiplying by the Schur Complement involves 2 Poisson solves and some cheap transfer-
ring.
S −1 v should be solved using Krylov methods. People have recommended the use of S1−1 or S2−1
or (S1−1 + S2−1 ) as a preconditioner
8.2.4 Accellerants
Domain Decomposition as a Preconditioner
It seems wasteful to solve subproblems extremely accurately during the early stages of the algorithm
when the boundary data is likely to be fairly inaccurate. Therefore it makes sense to run a few
steps of an iterative solver as a preconditioner for the solution to the entire problem.
In a modern approach to the solution of the entire problem, a step or two of block Jacobi
would be used as a preconditioner in a Krylov based scheme. It is important at this point not to
Preface 115
lose track what operations may take place at each level. To solve the subdomain problems, one
might use multigrid, FFT, or preconditioned conjugate gradient, but one may choose to do this
approximately during the early iterations. The solution of the subdomain problems itself may serve
as a preconditioner to the solution of the global problem which may be solved using some Krylov
based scheme.
The modern approach is to use a step of block Jacobi or block Gauss-Seidel as a preconditioner
for use in a Krylov space based subsolver. There is not too much point in solving the subproblems
exactly on the smaller domains (since the boundary data is wrong) just an approximate solution
suffices → domain decomposition preconditioning
Krylov Methods - Methods to solve linear systems : Au=g . Examples have names such
as the Conjugate Gradient Method, GMRES (Generalized Minimum Residual), BCG (Bi Conju-
gate Gradient), QMR (Quasi Minimum Residual), CGS (Conjugate Gradient Squared). For this
lecture, one can think of these methods in terms of a black-box. What is needed is a subroutine
that given u computes Au. This is a matrix-vector multiply in the abstract sense, but of course
it is not a dense matrix-vector product in the sense one practices in undergraduate linear algebra.
The other needed ingredient is a subroutine to approximately solve the system. This is known as a
preconditioner. To be useful this subroutine must roughly solve the problem quickly.
These modern approaches are designed to greatly speed convergence by solving the problem on dif-
ferent sized grids with the goal of communicating information between subdomains more efficiently.
Here the “domain” is a course grid. Mathematically, it is as easy to consider a contiguous domain
consisting of neighboring points, as it is is to consider a course grid covering the whole region.
Up until now, we saw that subdividing a problem did not directly yield the final answer, rather it
simplified or allowed us to change our approach in tackling the resulting subproblems with existing
methods. It still required that individual subregions be composited at each level of refinement to
establish valid conditions at the interface of shared boundaries.
Multilevel approaches solve the problem using a coarse grid over each sub-region, gradually
accommodating higher resolution grids as results on shared boundaries become available. Ideally
for a well balanced multi-level method, no more work is performed at each level of the hierarchy
than is appropriate for the accuracy at hand.
In general a hierarchical or multi-level method is built from an understanding of the difference
between the damping of low frequency and high components of the error. Roughly speaking one
can kill of low frequency components of the error on the course grid, and higher frequency errors
on the fine grid.
Perhaps this is akin to the Fast Multipole Method where p poles that are “well-separated” from
a given point could be considered as clusters, and those nearby are evaluated more precisely on a
finer grid.
In our file neighbor.data which you can take from ~edelman/summer94/friday we have encoded
information about neighbors and connections. You will see numbers such as
1 0 0 0 0 0
4 0 0 0 0 0
0 0 0 0
0
This contains information about the 0th rectangle. The first line says that it has a neighbor 1. The
4 means that the neighbor meets the rectangle on top. (1 would be the bottom, 6 would be the
lower right.) We starred out a few entries towards the bottom. Figure out what they should be.
In the actual code (solver.f), a few lines were question marked out for the message passing.
Figure out how the code works and fill in the appropriate lines. The program may be compiled
with the makefile.
Lecture 9
Particle Methods
There is another complication that occurs when we form pairwise sums of functions. If the
expansions are multipole or Taylor expansions, we may shift to a new center that is outside the
region of convergence. The coefficients may then be meaningless. Numerically, even if we shift
towards the boundary of a region of convergence, we may well lose accuracy, especially since most
computations choose to fix the number of terms in the expansion to keep.
The fast multipole algorithm accounts for these difficulties in a fairly simple manner. Instead
of computing the sum of the functions all the way up the tree and then broadcasting back, it saves
the intermediate partial sums summing them in only when appropriate. The figure below indicates
when this is appropriate.
119
120 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
(a , a, a,
a) (a , a, a,
a)
f (z)
f (z)
9.3 Outline
• Formulation and applications
Update configuration
point. For different applications we will have different interaction forces, such as gravitational or
Coulombic forces. We could even use these methods to model spring systems, although the advanced
methods, which assume forces decreasing with distance, do not work under these conditions.
9.5 Examples
• Astrophysics: The bodies are stars or galaxies, depending on the scale of the simulation.
The interactive force is gravity.
• Plasma Physics: The basic particles are ions, electrons, etc; the force is Coulombic.
• Molecular Dynamics: Particles are atoms or clusters of atoms; the force is electrostatic.
• Fluid Dynamics: Vortex method where particle are fluid elements (fluid blobs).
Typically, we call this class of simulation methods, the particle methods. In such simulations,
it is important that we choose both spatial and temporal scales carefully, in order to minimize
running time and maximize accuracy. If we choose a time scale too large, we can lose accuracy
in the simulation, and if we choose one too small, the simulations will take too long to run. A
simulation of the planets of the solar system will need a much larger timescale than a model of
charged ions. Similarly, spatial scale should be chosen to minimize running time and maximize
accuracy. For example, in applications in fluid dynamics, molecular level simulations are simply
too slow to get useful results in a reasonable period of time. Therefore, researchers use the vortex
method where bodies represent large aggregates of smaller particles. Hockney and Eastwood’s
book Computer Simulations Using Particles,, McGraw Hill (1981), explores the applications
of particle methods applications, although it is somewhat out of date.
We will use gravitational forces as an example to present N-body simulation algorithms. Assume
there are n bodies with masses m1 , m2 , . . . , mn , respectively, initially located at x~1 , ..., x~n ∈ <3 with
velocity v~1 , ..., v~n . The gravitational force exert on the ith body by the jth body is given by
mi mj mi mj
F~ij = G 2 = G (x~j − x~i ),
r |x~j − x~i |3
where G is the gravitational constant. Thus the total force on the ith body is the vector sum
of all these forces and is give by,
X
F~i = F~ij .
j6=i
Let a~i = d~ vi /dt be the acceleration of the body i, where where v~i = dx~i /dt. By Newton’s second
law of motion, we have F~i = mi a~i = mi d~ vi /dt.
In practice, we often find that using a potential function V = φm will reduce the labor of
the calculation. First, we need to compute the potential due to the N-body system position, i.e.
x1 , . . . , xn , at positions y1 , . . . , yn .
The total potential is calculated as
n
X
Vi = φ(xi − yj )mj 1 ≤ i, j ≤ n,
i,j=1;i6=j
where φ is the potential due to gravity. This can also be written in the matrix form:
0 . . . (xi − yj )
V = ... ... ...
φ(xj − yi ) . . . 0
In <3 ,
1
φ(x) =
kxk
.
In <2 ,
φ(x) = logk x k
.
The update of particle velocities and positions are in three steps:
1. F = π · m;
F
2. Vnew = Vold + ∆t · m;
The first step is the most expensive part in terms of computational time.
Preface 123
From the given initial configuration, we can derive the next time step configuration using the
formulae by first finding the force, from which we can derive velocity, and then position, and then
force at the next time step.
High order numerical methods can be used here to improve the simulation. In fact, the Euler
method that uses uniform time step discretization performs poorly during the simulation when two
bodies are very close. We may need to use non-uniform discretization or a sophisticated time scale
that may vary for different regions of the N-body system.
In one region of our simulation, for instance, there might be an area where there are few bodies,
and each is moving slowly. The positions and velocities of these bodies, then, do not need to be
sampled as frequently as in other, higher activity areas, and can be determined by extrapolation.
See figure 9.3 for illustration.1
How many floating point operations (flops) does each step of the Euler method take? The
velocity update (step 1) takes 2n floating point multiplications and one addition and the position
updating (step 2) takes 1 multiplication and one addition. Thus, each Euler step takes 5n floating
point operations. In Big-O notation, this is an O(n) time calculation with a constant factor 5.
Notice also, each Euler step can be parallelized without communication overhead. In data
parallel style, we can express steps (1) and (2), respectively, as
V = V + ∆t(F/M )
X = X + ∆tV,
where V is the velocity array; X is the position array; F is the force array; and M is the mass
array. V, X, F, M are 3 × n arrays with each column corresponding to a particle. The operator / is
the elementwise division.
1
In figure 9.3 we see an example where we have some close clusters of bodies, and several relatively disconnected
bodies. For the purposes of the simulation, we can ignore the movement of relatively isolated bodies for short periods
of time and calculate more frames of the proximous bodies. This saves computation time and grants the simulation
more accuracy where it is most needed. In many ways these sampling techniques are a temporal analogue of the later
discussed Barnes and Hut and Multipole methods.
124 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
mi mj mi mj
F~ij = G 2 = G (x~j − x~i ),
r |x~j − x~i |3
Note that the step for computing F~ij takes 9 flops. It takes n flops to add F~ij (1 ≤ j ≤ n).
Since F~ij = −F~ji , the total number of flops needed is roughly 5n2 . In Big-O notation, this is an
O(n2 ) time computation. For large scale simulation (e.g., n = 100 million), the direct method is
impractical with today’s level of computing power.
It is clear, then, that we need more efficient algorithms. The one fact that we have to take
advantage of is that in a large system, the effects of individual distant particles on each other may
be insignificant and we may be able to disregard them without significant loss of accuracy. Instead
we will cluster these particles, and deal with them as though they were one mass. Thus, in order
to gain efficiency, we will approximate in space as we did in time by discretizing.
The force acting on a body with unit mass at position ~x is given by the gradient of Φ, i.e.,
F = −∇Φ(x).
Preface 125
F~ = −∇Φ(~x).
So, from Φ we can calculate the force field (by numerical approximation).
The potential field Φ satisfies a Poisson equation:
The resulting linear system is of size equal to the number of grid points chosen. This can be solved
using methods such as FFT (fast Fourier transform), SOR (successive overrelaxation), multigrid
methods or conjugate gradient. If n bodies give a relatively uniform distribution, then we can use
a grid which has about n grid points. The solution can be fairly efficient, especially on parallel
machines. For highly non-uniform set of bodies, hybrid methods such as finding the potential
induced by bodies within near distance by direct method, and approximate the potential field
induced by distant bodies by the solution of a much smaller Poisson equation discretization. More
details of these methods can be found in Hockney and Eastwood’s book.
126 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
m particles n particles
r1 r2
r
P
i∈A mi xi
cA =
MA
P
j∈B mj xj
cB = .
MB
We can approximate the force induced by bodies in B on a body of mass mx located at position
s by viewing B as a single mass MB at location cB . That is,
Gmx MB (x − cB )
F (x) ≈ .
||x − cB ||3
Such approximation is second order: The relative error introduced by using center of mass is
bounded by (max(r1 , r2 )/r)2 . In other words, if f (x) be the true force vector acting on a body at
Preface 127
Pl Pr
x, then
2 !!
max(r1 , r2 )
F (x) = f (x) 1 + O .
r
This way, we can find all the interaction forces between A and B in O(n + m) time. The force
calculations between one m particle will computed separately using a recursive construction. This
observation gives birth the idea of hierarchical methods.
We can also describe the method in terms of potentials. If r is much larger than both r 1 and r2 ,
i.e., A and B are “well-separated”, then we can use the pth order multipole expansion (to be given
later) to express the pth order approximation of potential due to all particles in B. Let Φ pB (x)
denote such a multipole expansion. To (approximately) compute the potential at particles in A,
we simply evaluate ΦpB () at each particle in A. Suppose ΦpB () has g(p, d) terms. Using multipole
expansion, we reduce the number of operations to g(p, d)(|A| + |B|). The error of the multipole-
expansion depends on p and the ratio max(r1 , r2 )/r. We say A and B are β-well-separated, for a
β > 2, if max(r1 , r2 )/r ≤ 1/β. As shown in [44], the error of the pth order multipole expansion is
bounded by (1/(β − 1))p .
9.8 Quadtree (2D) and Octtree (3D) : Data Structures for Canon-
ical Clustering
Hierarchical N-body methods use quadtree (for 2D) and octtree (for 3D) to generate a canonical
set of boxes to define clusters. The number of boxes is typically linear in the number of particles,
i.e., O(n).
Quadtrees and octtrees provide a way of hierarchically decomposing two dimensional and three
dimensional space. Consider first the one dimensional example of a straight line segment. One way
to introduce clusters is to recursively divide the line as shown in Figure 9.5.
This results in a binary tree2 .
In two dimensions, a box decomposition is used to partition the space (Figure 9.6). Note that
a box may be regarded as a “product” of two intervals. Each partition has at most one particle in
it.
2
A tree is a graph with a single root node and a number of subordinate nodes called leaves or children. In a binary
tree, every node has at most two children.
128 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
A quadtree [88] is a recursive partition of a region of the plane into axis-aligned squares. One
square, the root, covers the entire set of particles. It is often chosen to be the smallest (up to a
constant factor) square that contains all particles. A square can be divided into four child squares,
by splitting it with horizontal and vertical line segments through its center. The collection of squares
then forms a tree, with smaller squares at lower levels of the tree. The recursive decomposition
is often adaptive to the local geometry. The most commonly used termination condition is: the
division stops when a box contains less than some constant (typically m = 100) number of particles
(See Figure 9.6). √
In 2D case, the height of the tree is usually log2 N . This is in the order of . The complexity
of the problem is N · O(log(N )).
Octtree is the three-dimension version of quadtree. The root is a box covering the entire set
of particles. Octtree are constructed by recursively and adaptively dividing a box into eight child-
boxes, by splitting it with hyperplanes normal to each axes through its center (See Figure 9.7).
m m
1 4
c c
1 4
m m
2 3
c c
2 3
Refer to Figure 9.6 for the two dimensional case. Treating each box as a uniform cluster, the
center of mass may be hierarchically computed. For example, consider the four boxes shown
in Figure 9.8.
The total mass of the system is
m = m1 + m 2 + m 3 + m 4 (9.5)
and the center of mass is given by
m1 c~1 + m2 c~2 + m3 c~3 + m4 c~4
~c = (9.6)
m
The total time required to compute the centers of mass at all layers of the quadtree is pro-
portional to the number of nodes, or the number of bodies, whichever is greater, or in Big-O
notation, O(n + v), where v is for vertex
This result is readily extendible to the three dimensional case.
Using this approximation will lose some accuracy. For instance, in 1D case, consider three
particles locate at x = −1, 0, 1 with strength m = 1. Consider these three particles as a
cluster, the total potential is
1 1 1
V (x) = + + .
x x−1 x+1
Expand the above equation using Taylor’s series,
3 2 2
V (x) = + 3 + 5 + ....
x x x
. It is seen that high order terms are neglected. This brings the accuracy down when x is
close to the origin.
2. Pushing the particle down the tree
Consider the case of the octtree i.e. the three dimensional case. In order to evaluate the
potential at a point ~xi , start at the top of the tree and move downwards. At each node, check
whether the corresponding box, b, is well separated with respect to ~xi (Figure 9.9).
Let the force at point ~xi due to the cluster b be denoted by F~ (i, b). This force may be
calculated using the following algorithm:
130 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
_ b
x
f or k = 1 to 8
F~ (~xi ) = F~ (~xi ) + F~ (i, child(b, k)) (9.8)
(9.9)
The computational complexity of pushing the particle down the tree has the upper bound
9hn, where h is the height of the tree and n is the number of particles. (Typically, for more
or less uniformly distributed particles, h = log 4 n.)
The potential at a point ~x due to the cluster B, for example, is given by the following second order
approximation:
mB 1
φ(~x) ≈ (1 + 2 ) in <3 (9.10)
||~x − c~B || δ
In other words, each cluster may be regarded as an individual particle when the cluster is sufficiently
far away from the evaluation point ~x.
A more advanced idea is to keep track of a higher order (Taylor expansion) approximation of
the potential function induced by a cluster. Such an idea provides better tradeoff between time
required and numerical precision. The following sections provide the two dimensional version of
the fast multipole method developed by Greengard and Rokhlin.
Preface 131
The Barnes-Hut method discussed above uses the particle-cluster interaction between two well-
separated clusters. Greengard and Rokhlin showed that the cluster-cluster intersection among
well-separated clusters can further improve the hierarchical method. Suppose we have k clusters
B1 ..., Bk that are well-separated from a cluster A. Let Φpi () be the pth order multipole expansion
of Bi . Using particle-cluster interaction to approximate the far-field potential at A, we need to
perform g(p, d)|A|(|B1 | + |B2 | + ... + |Bk |) operations. Greengard and Rokhlin [44] showed that
from Φpi () we can efficiently compute a Taylor expansion Ψpi () centered at the centroid of A that
approximates Φpi (). Such an operation of transforming Φpi () to Ψpi () is called a FLIP. The cluster-
cluster interaction first flips Φpi () to Ψpi (); we then compute ΨpA () = ki=1 Ψpi () and use ΨpA () to
P
evaluate the potential at each particle in A. This reduces the number of operations to the order of
9.10 Outline
• Introduction
• Multipole Expansion
• Taylor Expansion
9.11 Introduction
For N-body simulations, sometimes, it is easier to work with the (gravitational) potential rather
than with the force directly. The force can then be calculated as the gradient of the potential.
In two dimensions, the potential function at zj due to the other bodies is given by
n
X
φ(zj ) = qi log(zj − zi )
i=1,i6=j
Xn
= φzi (zj )
i=1,i6=j
with
φzi (z) = qi log |z − zi |
where z1 , . . ., zn the position of particles, and q1 , . . ., qn the strength of particles. The potential
due to the bodies in the rest of the space is
n
X
φ(z) = qi log(z − zi )
i=1
132 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Z
Faraway
Z1 Evaluation
Point
Z4
Z2
Zc
Zn
Z3
Cluster of Bodies
which is singular at each potential body. (Note: actually the potential is Re φ(z) but we take the
complex version for simplicity.)
With the Barnes and Hut scheme in term of potentials, each cluster may be regarded as an
individual particle when the cluster is sufficiently far away from the evaluation point. The following
sections will provide the details of the fast multipole algorithm developed by Greengard and Rokhlin.
Many people are often mystified why the Green’s function is a logarithm in two dimensions,
while it is 1/r in three dimensions. Actually there is an intuitive explanation. In d dimensions
the Green’s function is the integral of the force which is proportional 1/r d−1 . To understand the
1/rd−1 just think that the lines of force are divided “equally” on the sphere of radius r. One might
wish to imagine an d dimensional ball with small holes on the boundary filled with d dimensional
water. A hose placed at the center will force water to flow out radially at the boundary in a uniform
manner. If you prefer, you can imagine 1 ohm resistors arranged in a polar coordinate manner,
perhaps with higher density as you move out in the radial direction. Consider the flow of current
out of the circle at radius r if there is one input current source at the center.
a perfectly valid representation of a function which typically converges outside a circle rather than
inside. For example, it is easy to show that
where zc is any complex number. This series converges in the region |z − zc | > |z − zi |, i.e., outside
of the circle containing the singularity. The formula is particularly useful if |z − z c | |z − zi |, i.e.,
if we are far away from the singularity.
Note that
where
n
X qi (zi − zc )k
ak = −
i=1
k
When we truncate the expansion due to the consideration of computation cost, an error is
introduced into the resulting potential. Consider a p-term expansion
p
X 1
φp (z) = Q log(z − zc ) + ak
k=1
(z − zc )k
q1
Z1
q2
Z3
ZC Faraway
q3
Z2
Z4
Cluster of
Evaluation Points
At this moment, we are able to calculate the potential of each particle due to cluster of far away
bodies, through multipole expansion.
∞
X
φC,zc (z) = bk (z − zc )k
k=0
Denote z −zi = (z −zc )−(zi −zc ) = −(zi −zc )(1−ξ). Then for z such that |z −zc | < min(zc , C),
we have |ξ| < 1 and the series φC,zc (z) converge:
X X
φC,zc (z) = qi log(−(zi − zc )) + qi log(1 − ξ)
C C
∞
!
X X X
= qi log(−(zi − zc )) + qi k −1 (zi − zc )−k (z − zc )k
C k=1 C
∞
X
= b0 + bk (z − zc )k
k=1
p
X
φpC,zc (z) = bk (z − zc )k .
k=0
P
where A = C |qi | and c = |z − zc |/ min(zc , C) < 1.
By now, we can also compute the local potential of the cluster through the Taylor expansion.
During the process of deriving the above expansions, it is easy to see that
At this point, we have finished the basic concepts involved in the multipole algorithm. Next, we will
begin to consider some of the operations that could be performed on and between the expansions.
136 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
ZC-Child1 ZC-Child3
ZC-Parent
ZC-Child2 ZC-Child4
Note that a0l depends only on a0 , a1 , . . . , al and not on the higher coefficients. It shows that
given φpz0 we can compute φpz1 exactly, that is without any further error! In other words, operators
SHIFT and truncation commute on multipolar expansions:
Similarly, we can obtain the SHIFT operation for the local Taylor expansion, by extending the
operator on the domain of local expansion, so that SHIF T (φC,z0 , z0 ⇒ z1 ) produces φC,z1 . Both
series converges for z such that |z − z0 | < min(z0 , C), |z − z1 | < min(z1 , C). Then
Notice that in this case, b0l depends only on the higher coefficients, which means knowledge of
the coefficients b0 , b1 , . . . , bp from the truncated local expansion in z0 does not suffice to recover the
coefficients b00 , b01 , . . . b0p at another point z1 . We do incur an error by the SHIFT operation applied
to truncated local expansion:
∞ ∞
X X
p p k−l k−l
l
SHIF T (φC,z0 , z0 ⇒ z1 ) − φC,z1 =
bk (−1) (z0 − z1 ) (z − z1 )
l=0 k=p+1
∞ ∞
X X z − z1 l
k
≤
bk (z1 − z0 )
k=p+1 l=0 z1 − z0
k X l
∞
∞
X −1 X z 1 − z 0 z − z 1
= k qi
k=p+1
zi − z 0
C
l=0
z0 − z 1
A
≤ cp+1 ,
(p + 1)(1 − c)(1 − D)
P
where A = C |qi |. c = |z1 − z0 |/ min(z0 , C) and D = |z − z1 |/|z0 − z1 |.
At this moment, we have obtained all the information needed to perform the SHIFT operation
for both multipole expansion and local Taylor expansion. Next, we will consider the operation
which can transform multipole expansion to local Taylor expansion.
I I I
N N I
C N I
FLIP
multipole expansion φz0 (z) to the local Taylor expansion φC,z1 (z), denoted by
For |z − z0 | > max(z0 , C) and |z − z1 | < min(z1 , C) both series converge. Note that
z − z1
z − z0 = −(z0 − z1 )(1 − ) = −(z0 − z1 )(1 − ξ)
z0 − z 1
and assume also |ξ| < 1. Then,
∞
X
φz0 (z) = a0 log(z − z0 ) + ak (z − z0 )−k
k=1
∞
X
= a0 log(−(z0 − z1 )) + a0 log(1 − ξ) + ak (−1)k (z0 − z1 )−k (1 − ξ)−k
k=1
∞ ∞ ∞
!
X
−1 l
X
k −k
X k+l−1
= a0 log(−(z0 − z1 )) + − a0 l ξ + (−1) ak (z0 − z1 ) ξl
l=1 k=1 l=0
l
∞
!
X
= a0 log(−(z0 − z1 )) + (−1)k ak (z0 − z1 )−k +
k=1
∞ ∞
! !
X
−1 −l
X
k k+l−1
a0 l (z0 − z1 ) + (−1) ak (z0 − z1 )−(k+l) (z − z1 )l .
l=1 k=1
l
Note that FLIP does not commute with truncation since one has to know all coefficients
a0 , a1 , . . . to compute b0 , b1 , . . . , bp exactly. For more information on the error in case of truncation,
see Greengard and Rokhlin (1987).
Preface 139
N N
C N
I I I I
I I I I
N N I I
C N I I
F F F F F F
F F F F F F
I I I I F F
I I I I F F
N N I I F F
C N I I F F
• FARAWAY — a faraway cell F to a cell C is defined as any cell which is neither a neighbor
nor an interactive to C
Now, we start at the top level of the tree. For each cell C, FLIP the multipole expansion for the
interactive cells and combine the resulting local Taylor expansions into one expansion series. After
all of the FLIP and COMBINE operations are done, SHIFT the local Taylor expansion from the
node in this level to its four children in the next lower level, so that the information is conserved
from parent to child. Then go down to the next lower level where the children are. To all of
the cells at this level, the faraway field is done (which is the interactive zone at the parent level).
So we will concentrate on the interactive zone at this level. Repeat the FLIP operation to all of
the interactive cells and add the flipped multipole expansion to the Taylor expansion shifted from
parent node. Then repeat the COMBINE and SHIFT operations as before. This entire process
will continue from the top level downward until the lowest level of the tree. In the end, add them
together when the cells are close enough.
where z1 , . . ., zn the position of particles, and q1 , . . ., qn the strength of particles. The corresponding
multipole expansion for the cluster centered at zc is
∞
X 1
φzc (z) = a0 log(z − zc ) + ak
k=1
(z − zc )k
In three dimensions, the potential as well as the expansion series become much more compli-
cated. The 3-D potential is given as
n
X 1
Φ(x) = qi
i=1
||x − xi ||
where x = f (r, θ, φ). The corresponding multipole expansion and local Taylor expansion as follow-
ing
∞ n
X 1 X
Φmultipole (x) = am m
n Yn (θ, φ)
n=0
rn+1 m=−n
∞
X n
X
ΦT aylor (x) = rn bm m
n Yn (θ, φ)
n=0 m=−n
where Ynm (θ, φ) is the Spherical Harmonic function. For a more detailed treatment of 3-D expan-
sions, see Nabors and White (1991).
• We balance the computational load at each processor. This is directly related to the number
of non-zero entries in its matrix block.
• We minimize the communication overhead. How many other values does a processor have to
receive? This equals the number of these values that are held at other processors.
We must come up with a proper division to reduce overhead. This corresponds to dividing
up the graph of the matrix among the processors so that there are very few crossing edges. First
assume that we have 2 processors, and we wish to partition the graph for parallel processing. As an
easy example, take a simplistic cut such as cutting the 2D regular grid of size n in half through the
middle. Let’s define the cut size as the number of edges whose endpoints are in different groups.
√
A good cut is one with a small cut size. In our example, the cut size would be n. Assuming that
the cost of each communication is 10 times more than an local arithmetic operation. Then the
√ √
total parallel cost of perform matrix-vector product on the grid is (4n)/2 + 10 n = 2n + 10 n. In
general, for p processors, to need to partition the graph into p subgraphs.
143
144 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Various methods are used to break up a graph into parts such that the number of crossing edges
is minimized. Here we’ll examine the case where p = 2 processors, and we want to partition the
graph into 2 halves.
10.2 Separators
The following definitions will be of use in the discussion that follows. Let G = (V, E) be an
undirected graph.
• A bisection of G is a division of V into V1 and V2 such that |V1 | = |V2 |. (If |V | is odd,
then the cardinalities differ by at most 1). The cost of the bisection (V1 , V2 ) is the number
of edges connecting V1 with V2 .
• An edge separator of G is a set of edges that if removed, would break G into 2 pieces with
no edges connecting them.
• A p-way partition is a division of V into p pieces V1 ,V2 ,. . .,Vp where the sizes of the various
pieces differ by at most 1. The cost of this partition is the total number of edges crossing the
pieces.
• A vertex separator is a set C of vertices that break G into 3 pieces A, B, and C where no
edges connect A and B. We also add the requirement that A and B should be of roughly
equal size.
Usually, we use edge partitioning for parallel sparse matrix-vector product and vertex parti-
tioning for the ordering in direct sparse factorization.
B. Laplacian of a Graph
VB
I= . (10.1)
eff. res.
This so called Graph is by definition a collection of nodes and edges.
Preface 145
5
2
3 5
3 4
1 4 6
2 7
6
1
Battery n nodes
VB volts m edges
2 −1 −1 V1 0
−1 3 −1 −1 V2
0
−1 −1 3 −1 V
3
1
= (10.2)
−1 3 −1 −1 V
4
−1
−1 −1 3 −1 V5 0
−1 −1 2 V6 0
The matrix contains the information of how the resistors are connected, it is called the Laplacian
of the Graph
Yet there is another way to obtain the Laplacian. First we have to set up the so called Node–edge
Incidence matrix, which is a m × n matrix and its elements are either 1, −1 or 0 depending on
whether the edge (row of the matrix) connects the node (column of matrix) or not. We find
146 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
A B
Local calls
Local calls
Long distance calls
Figure 10.2: Partitioning of a telephone net as an example for a graph comparable in cost
nodes 1 2 3 4 5 6
edges
1 1
−1
2 1 −1
3
1 −1
MG = 4
1 −1
(10.6)
5
1 −1
6 1 −1
7 1 −1
8 1 −1
MTG MG = ∇2G (10.7)
0.5
-0.5
-1
1
0.5 1
0 0.5
0
-0.5
-0.5
-1 -1
In general, we may need to divide a graph into more than one piece. The most commonly used
approach is to recursively apply the partitioner that divides the graph into two pieces of roughly
equal size. More systematic treatment of the partitioning problem will be covered in future lectures.
The success of the spectral method in practice has a physical interpretation. Suppose now we
have a continuous domain in place of a graph. The Laplacian of the domain is the continuous
counterpart of the Laplacian of a graph. The kth eigenfunction gives the kth mode of vibration
and the second eigenfunction induce a cut (break) of the domain along the weakest part of the
domain. (See figure10.5)
Figure 10.6: The input is a mesh with specified coordinates. Every triangle must be “well-shaped”,
which means that no angle can be too small. Remarks: The mesh is a subgraph of the intersection
graph of a set of disks, one centered at each mesh point. Because the triangles are well-shaped,
only a bounded number of disks can overlap any point, and the mesh is an “alpha-overlap graph”.
This implies that it has a good separator, which we proceed to find.
cut size O(n1−1/d ), where the cut size is the number of elements in the vertex separator. Note that
this bound on the cut size is of the same order as for a regular grid of the same size. Such a bound
does not hold for spectral method in general.
The geometric partitioning method has a great deal of theory behind it, but the implementation
is relatively simple. For this reason, we will begin with an illustrative example before discussing
the method and theoretical background in more depth. To motivate the software development
aspect of this approach, we use the following figures (Figures 10.6 – 10.13) generated by a Matlab
implementation (written by Gilbert and Teng) to outline the steps for dividing a well-shaped mesh
into two pieces. The algorithm works on meshes in any dimension, but we’ll stick to two dimensions
for the purpose of visualization.
To recap, the geometric mesh partitioning algorithm can be summed up as follows (a more
precise algorithm follows):
• Perform a stereographic projection to map the points in a d-dimensional plane to the surface
of a d + 1 dimensional sphere
• Perform a conformal map to move the centerpoint to the center of the sphere
• Find a plane through the center of the sphere that approximately divides the nodes equally;
translate this plane to obtain a more even division
• Undo the conformal mapping and the stereographic mapping. This leaves a circle in the
plane.
150 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1.5 −1 −0.5 0 0.5 1 1.5
Figure 10.7: Let’s redraw the mesh, omitting the edges for clarity.
−0.2
−0.4
−0.6
−0.8
0.5 1
0.5
0
0
−0.5
−0.5
−1 −1
Figure 10.8: First we project the points stereographically from the plane onto the surface of a
sphere (of one higher dimension than the mesh) tangent to the plane of the mesh. A stereographic
projection is done by drawing a line between a point A on the plane and the north pole of the
sphere, and mapping the point A to the intersection of the surface of the sphere and the line. Now
we compute a “centerpoint” for the projected points in 3-space. A centerpoint is defined such that
every plane through the centerpoint separates the input points into two roughly equal subsets.
(Actually it’s too expensive to compute a real centerpoint, so we use a fast, randomized heuristic
to find a pretty good approximation.
Preface 151
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
0.5 1
0.5
0
0
−0.5
−0.5
−1 −1
Figure 10.9: Next, we conformally map the points so that the centerpoint maps to the center of
the sphere. This takes two steps: First we rotate the sphere about the origin (in 3-space) so that
the centerpoint is on the z axis, and then we scale the points in the plane to move the centerpoint
along the z axis to the origin (this can be thought of as mapping the sphere surface back to the
plane, stretching the plane, then re-mapping the plane back to the surface of the sphere). The
figures show the final result on the sphere and in the plane.
• Regular Grids: These arise, for example, from finite difference methods.
• ‘Quad”-tree graphs and “Meshes”: These arise, for example, from finite difference meth-
ods and hierarchical N-body simulation.
1.5
0.5
−0.5
−1
−1.5
−2
−3 −2 −1 0 1 2 3
Figure 10.10: Because the approximate centerpoint is now at the origin, any plane through the
origin should
p divide the points roughly evenly. Also, most planes only cut a small number of mesh
edges (O( (n)), to be precise). Thus we find a separator by choosing a plane through the origin,
which induces a great circle on the sphere. In practice, several potential planes are considered, and
the best one accepted. Because we only estimated the location of the centerpoint, we must shift
the circle slightly (in the normal direction) to make the split exactly even. The second circle is the
shifted version.
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
0.5 1
0.5
0
0
−0.5
−0.5
−1 −1
Figure 10.11: We now begin “undoing” the previous steps, to return to the plane. We first undo
the conformal mapping, giving a (non-great) circle on the original sphere ...
Preface 153
−0.2
−0.4
−0.6
−0.8
0.5 1
0.5
0
0
−0.5
−0.5
−1 −1
Figure 10.12: ... and then undo the stereographic projection, giving a circle in the original plane.
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−2 −1.5 −1 −0.5 0 0.5 1
Figure 10.13:
p This partitions the mesh into two pieces with about n/2 points each, connected by
at most O( (n)) edges. These connecting edges are called an ”edge separator”. This algorithm
can be used recursively if more divisions (ideally a power of 2) are desired.
154 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
• Disk packing graphs: If a set of non-overlapping disks is laid out in a plane, we can tell
which disks touch. The nodes of a disk packing graph are the centers of the disks, and edges
connect two nodes if their respective disks touch.
• Planar graphs: These are graphs that can be drawn in a plane without crossing edges. Note
that disk packing graphs are planar, and in fact every planar graph is isomorphic to some
disk-packing graph (Andreev and Thurston).
A neighborhood system and another parameter α define an overlap graph. There is a vertex for
each disk. For α = 1, an edge joins two vertices whose disks intersect. For α > 1, an edge joins two
vertices if expanding the smaller of their two disks by a factor of α would make them intersect.
Definition 10.4.2 Let α ≥ 1, and let {D1 , . . . , Dn } be a k-ply neighborhood system. The (α, k)-
overlap graph for the neighborhood system is the graph with vertex set {1, . . . , n} and edge set
We make an overlap graph into a mesh in d-space by locating each vertex at the center of its disk.
Overlap graphs are good models of computational meshes because every mesh of bounded-
aspect-ratio elements in two or three dimensions is contained in some overlap graph (for suitable
choices of the parameters α and k). Also, every planar graph is an overlap graph. Therefore, any
theorem about partitioning overlap graphs implies a theorem about partitioning meshes of bounded
aspect ratio and planar graphs.
We now describe the geometric partitioning algorithm.
We start with two preliminary concepts. We let Π denote the stereographic projection mapping
from IRd to S d , where S d is the unit d-sphere embedded in IRd+1 . Geometrically, this map may
be defined as follows. Given x ∈ IRd , append ‘0’ as the final coordinate yielding x0 ∈ IRd+1 . Then
compute the intersection of S d with the line in IRd+1 passing through x0 and (0, 0, . . . , 0, 1)T . This
intersection point is Π(x).
Algebraically, the mapping is defined as
!
2x/χ
Π(x) =
1 − 2/χ
Preface 155
where χ = xT x + 1. It is also simple to write down a formula for the inverse of Π. Let u be a point
on S d . Then
ū
Π−1 (u) =
1 − ud+1
where ū denotes the first d entries of u and ud+1 is the last entry. The stereographic mapping,
besides being easy to compute, has a number of important properties proved below.
A second crucial concept for our algorithm is the notion of a center point. Given a finite subset
P ⊂ IRd such that |P | = n, a center point of P is defined to be a point x ∈ IRd such that if H is
any open halfspace whose boundary contains x, then
It can be shown from Helly’s theorem [25] that a center point always exists. Note that center points
are quite different from centroids. For example, a center point (which, in the d = 1 case, is the
same as a median) is largely insensitive to “outliers” in P . On the hand, a single distant outliers
can cause the centroid of P to be displaced by an arbitrarily large distance.
Geometric Partitioning Algorithm
Let P = {p1 , . . . , pn } be the input points in IRd that define the overlap graph.
4. Define P 00 = QP 0 (i.e., apply Q to each point in P 0 ). Note that P 00 ⊂ S d , and the center
point of P 00 is z 0 .
5. Let D be the matrix [(1 − θ)/(1 + θ)]1/2 I, where I is the d × d identity matrix. Let P 000 =
Π(DΠ−1 (P 00 )). Below we show that the origin is a center point of P 000 .
7. Transform S0 back to a sphere S ⊂ IRd by reversing all the transformations above, i.e.,
S = Π−1 (Q−1 Π(D−1 Π−1 (S0 ))).
8. From S compute a set of vertices of G that split the graph as in Theorem ??. In particular,
define C to be vertices embedded “near” S, define A be vertices of G − C embedded outside
S, and define B to be vertices of G − C embedded inside S.
We can immediately make the following observation: because the origin is a center point of
P 000 ,
and the points are split by choosing a plane through the origin, then we know that |A| ≤
(d + 1)n/(d + 2) and |B| ≤ (d + 1)n/(d + 2) regardless of the details of how C is chosen. (Notice
that the constant factor is (d + 1)/(d + 2) rather than d/(d + 1) because the point set P 0 lies in
156 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
IRd+1 rather than IRd .) Thus, one of the claims made in Theorem ?? will follow as soon as we have
shown that the origin is indeed a center point of P 000 at the end of this section.
We now provide additional details about the steps of the algorithm, and also its complexity
analysis. We have already defined stereographic projection used Step 1. Step 1 requires O(nd)
operations.
Computing a true center point in Step 2 appears to a very expensive operation (involving a
linear programming problem with nd constraints) but by using random (geometric) sampling, an
approximate center point can be found in random constant time (independent of n but exponential
in d) [100, 48]. An approximate center point satisfies 10.10 except with (d + 1 + )n/(d + 2) on the
right-hand side, where > 0 may be arbitrarily small. Alternatively, a deterministic linear-time
sampling algorithm can be used in place of random sampling [65, 96], but one must again compute
a center of the sample using linear programming in time exponential in d [67, 41].
In Step 3, the necessary orthogonal matrix may be represented as a single Householder
reflection—see [43] for an explanation of how to pick an orthogonal matrix to zero out all but
one entry in a vector. The number of floating point operations involved is O(d) independent of n.
In Step 4 we do not actually need to compute P 00 ; the set P 00 is defined only for the purpose
of analysis. Thus, Step 4 does not involve computation. Note that the z 0 is the center point
of P 00 after this transformation, because when a set of points is transformed by any orthogonal
transformation, a center point moves according to the same transformation (more generally, center
points are similarly moved under any affine transformation). This is proved below.
In Step 6 we choose a random great circle, which requires time O(d). This is equivalent to
choosing plane through the origin with a randomly selected orientation. (This step of the algorithm
can be made deterministic; see [?].) Step 7 is also seen to require time O(d).
Finally, there are two possible alternatives for carrying out Step 8. One alternative is that we
are provided with the neighborhood system of the points (i.e., a list of n balls in IR d ) as part of the
input. In this case Step 8 requires O(nd) operations, and the test to determine which points belong
in A, B or C is a simple geometric test involving S. Another possibility is that we are provided
with the nodes of the graph and a list of edges. In this case we determine which nodes belong in
A, B, or C based on scanning the adjacency list of each node, which requires time linear in the size
of the graph.
Theorem 10.4.1 If M is an unstructured mesh with bounded aspect ratio, then the graph of M is
a subgraph of a bounded overlap graph of the neighborhood system where we have one ball for each
vertex of M of radius equal to half of the distance to its nearest vertices. Clearly, this neighborhood
system has ply equal to 1.
Theorem 10.4.1 (Geometric Separators [67]) Let G be an n-vertex (α, k)-overlap graph in d
dimensions. Then the vertices of G can be partitioned into three sets A, B, and C, such that
• Trees have a 1-vertex separator with β = 2/3 - the so-called centroid of the tree.
• Planar√Graphs. A result of Lipton and Tarjan shows that a planar graph of bounded degree
has a 8n vertex separator with β = 2/3.
• d dimensional regular grids (those are used for basic finite difference method). As a folklore,
they have a separator of size n1−1/d with beta β = 1/2.
Inertia-based slicing
Williams [101] noted that RCB had poor worst case performance, and suggested that it could
be improved by slicing orthogonal to the principal axes of inertia, rather than orthogonal to the
coordinate axes. Farhat and Lesoinne implemented and evaluated this heuristic for partitioning
[33].
In three dimensions, let v = (vx , vy , vz )t be the coordinates of vertex v in IR3 . Then the inertia
matrix I of the vertices of a graph with respect to the origin is given by
Ixx Ixy Ixz
I = Iyx Iyy Iyz
Izx Izy Izz
where,
X X X
Ixx = vy2 + vz2 , Iyy = vx2 + vz2 , Izz = vx2 + vy2
v∈V v∈V v∈V
The eigenvectors of the inertia matrix are the principal axes of the vertex distribution. The
eigenvalues of the inertia matrix are the principal moments of inertia. Together, the principal axes
and principal moments of inertia define the inertia ellipse; the axes of the ellipse are the principal
axes of inertia, and the axis lengths are the square roots of the corresponding principal moments.
Physically, the size of the principal moments reflect how the mass of the system is distributed with
respect to the corresponding axis - the larger the principal moment, the more mass is concentrated
at a distance from the axis.
Let I1 , I2 , and I3 denote the principal axes of inertia corresponding to the principal moments
α1 ≤ α2 ≤ α3 . Farhat and Lesoinne projected the vertex coordinates onto I1 , the axis about
which the mass of the system is most tightly clustered, and partitioned using a planar cut through
the median. This method typically yielded a good initial separator, but did not perform as well
recursively on their test mesh - a regularly structured “T”-shape.
Farhat and Lesoinne did not present any results on the theoretical properties of the inertia
slicing method. In fact, there are pathological cases in which the inertia method can be shown to
yield a very poor separator. Consider, for example, a “+”-shape in which the horizontal bar is
very wide and sparse, while the vertical bar is relatively narrow but dense. I 1 will be parallel to
the horizontal axis, but a cut perpendicular to this axis through the median will yield a very large
separator. A diagonal cut will yield the smallest separator, but will not be generated with this
method.
Gremban, Miller, and Teng show how to use moment of inertia to improve the geometric par-
titioning algorithm.
• Matlab Mesh Partitioning Toolbox: written by Gilbert and Teng. It includes both edge and
vertex separators, recursive bipartition, nested dissection ordering, visualizations and demos,
and some sample meshes. The complete toolbox is available by anonymous ftp from machine
ftp.parc.xerox.com as file /pub/gilbert/meshpart.uu.
In other words, the ratio of the size of smallest leaf-box to the root-box is 1/2log2d (n/m)+µ . In
practice, µ is less than 100.
The Barnes-Hut algorithm, as an algorithm, can be easily generalized to the non-uniform case.
We describe a version of FMM for non-uniformly distributed particles. The method uses the box-
box interaction. FMM tries to maximize the number of FLIPs among large boxes and also tries to
FLIP between roughly equal sized boxes, a philosophy which can be described as: let parents do as
much work as possible and then do the left-over work as much as possible before passing to the next
generation. Let c1 , ..., c2d be the set of child-boxes of the root-box of the hierarchical tree. FMM
generates the set of all interaction-pairs of boxes by taking the union of Interaction-pair(c i , cj ) for
all 1 ≤ i < j ≤ 2d , using the Interaction-Pair procedure defined below.
Procedure Interaction-Pair (b1 , b2 )
• Else, if both b1 and b2 are leaf-boxes, then particles in b1 and b2 are near-field particles.
• Else, if both b1 and b2 are not leaf-boxes, without loss of generality, assuming that b2 is at
least as large as b1 and letting c1 , ..., c2d be the child-boxes of b2 , then recursively decide
interaction pair by calling: Interaction-Pair(b1 ,ci ) for all 1 ≤ i ≤ 2d .
• Else, if one of b1 and b2 is a leaf-box, without loss of generality, assuming that b1 is a leaf-box
and letting c1 , ..., c2d be the child-boxes of b2 , then recursively decide interaction pairs by
calling: Interaction-Pair(b1 ,ci ) for all 1 ≤ i ≤ 2d .
FMM for far-field calculation can then be defined as: for each interaction pair (b 1 , b2 ), letting
Φpi ()
(i = 1, 2) be the multipole-expansion of bi , flip Φp1 () to b2 and add to b2 ’s potential Taylor-
expansion. Similarly, flip Φp2 () to b1 and add to b1 ’s potential Taylor-expansion. Then traverse
down the hierarchical tree in a preordering, shift and add the potential Taylor-expansion of the
parent box of a box to its own Taylor-expansion.
Note that FMM for uniformly distributed particles has a more direct description (see Chapter
??).
A B
C D
that boxes A, B and C contains less than m (< 100) particles, and most particles, say n of them,
are uniformly distributed in D, see Figure 10.14. In FMM, we further recursively divide D by
log4 (n/m) levels. Notice that A, B, and C are not well-separated from any box in D. Hence the
FMM described in the previous subsection will declare all particles of D as near-field particles of
A, B, and C (and vice versa). The drawback is two-folds: (1) From the computation viewpoint, we
cannot take advantage of the hierarchical tree of D to evaluate potentials in A, B, and C. (2) From
the communication viewpoint, boxes A, B, and C have a large in-degree in the sense that each
particle in these boxes need to receive information from all n particles in D, making partitioning
and load balancing harder. Notice that in BH most boxes of D are well-separated from particles in
A, B, and C. Hence the well-separation condition is different in BH: because BH uses the particle-
box interaction, the well-separation condition is measured with respect to the size of the boxes in
D. Thus most boxes are well-separated from particles in A, B, and C. In contrast, because FMM
applies the FLIP operation, the well-separation condition must measure up against the size of the
larger box. Hence no box in D is well-separated from A, B, and C.
Our refined FMM circumvents this problem by incorporating the well-separation condition of
BH into the Interaction-Pair procedure: if b1 and b2 are not well-separated, and b1 , the larger of
the two, is a leaf-box, then we use a well-separation condition with respect to b2 , instead of to b1 ,
and apply the FLIP operation directly onto particles in the leaf-box b1 rather than b1 itself.
We will define this new well-separation condition shortly. First, we make the following ob-
servation about the Interaction-Pair procedure defined in the last subsection. We can prove, by
a simple induction, the following fact: if b1 and b2 are an interaction-pair and both b1 and b2
are not leaf-boxes, then 1/2 ≤ size(b1 )/size(b2 ) ≤ 2. This is precisely the condition that FMM
would like to maintain. For uniformly distributed particles, such condition is always true between
any interaction-pair (even if one of them is a leaf-box). However, for non-uniformly distributed
particles, if b1 , the larger box, is a leaf-box, then b1 could be much larger than b2 .
The new β-well-separation condition, when b1 is a leaf-box, is then defined as: b1 and b2 are
β-well-separated if b2 is well-separated from all particles of b1 (as in BH). Notice, however, with the
new condition, we can no longer FLIP the multipole expansion of b1 to a Taylor-expansion for b2 .
Because b1 has only a constant number of particles, we can directly evaluate the potential induced
by these particles for b2 . This new condition makes the FLIP operation of this special class of
interaction-pairs uni-directional: We only FLIP b2 to b1 .
We can describe the refined Interaction-Pair procedure using modified well-separation condition
when one box is a leaf-box.
Procedure Refined Interaction-Pair (b1 , b2 )
• If b1 and b2 are β-well-separated and 1/2 ≤ size(b1 )/size(b2 ) ≤ 2, then (b1 , b2 ) is a bi-
Preface 161
directional interaction-pair.
• Else, if the larger box, without loss of generality, b1 , is a leaf-box, then the well-separation
condition becomes: b2 is well-separated from all particles of b1 . If this condition is true, then
(b1 , b2 ) is a uni-directional interaction-pair from b2 to b1 .
• Else, if both b1 and b2 are leaf-boxes, then particles in b1 and b2 are near-field particles.
• Else, if both b1 and b2 are not leaf-boxes, without loss of generality, assuming that b2 is at
least as large as b1 and letting c1 , ..., c2d be the child-boxes of b2 , then recursively decide
interaction-pairs by calling: Interaction-Pair(b1 ,ci ) for all 1 ≤ i ≤ 2d .
• Else, if one of b1 and b2 is a leaf-box, without loss of generality, assuming that b1 is a leaf
box and letting c1 , ..., c2d be the child-boxes of b2 , then recursively decide interaction pairs
by calling: Interaction-Pair(b1 ,ci ) for all 1 ≤ i ≤ 2d .
Let c1 , ..., c2d be the set of child-boxes of the root-box of the hierarchical tree. Then the
set of all interaction-pair can be generated as the union of Refined-Interaction-Pair(c i , cj ) for all
1 ≤ i < j ≤ 2d .
The refined FMM for far-field calculation can then be defined as: for each bi-directional inter-
action pair (b1 , b2 ), letting Φpi () (i = 1, 2) be the multipole expansion of bi , flip Φp1 () to b2 and
add to b2 ’s potential Taylor-expansion. Similarly, flip Φp2 () to b1 and add to b1 ’s potential Taylor-
expansion. Then traverse down the hierarchical tree in a preordering, shift and add the potential
Taylor-expansion of the parent box of a box to its own Taylor-expansion. For each uni-directional
interaction pair (b1 , b2 ) from b2 to b1 , letting Φp2 () be the multipole-expansion of b2 , evaluate Φp2 ()
directly for each particle in b2 and add its potential.
Lemma 10.5.1 The refined FMM flips the multipole expansion of b2 to b1 if and only if (1) b2
is well-separated from b1 and (2) neither the parent of b2 is well-separated from b1 nor b2 is well-
separated from the parent of b1 .
BH defines two classes of communication graphs: BHSβ and BHPβ . BHSβ models the sequential
communication pattern and BHPβ is more suitable for parallel implementation. The letters S and
P , in BHSβ and BHPβ , respectively, stand for “Sequential” and “Parallel”.
We first define BHSβ and show why parallel computing requires a different communication graph
BHPβ to reduce total communication cost.
The graph BHSβ of a set of particles P contains two sets of vertices: P, the particles, and B, the
set of boxes in the hierarchical tree. The edge set of the graph BHSβ is defined by the communication
pattern of the sequential BH. A particle p is connected with a box b if in BH, we need to evaluate
p against b to compute the force or potential exerted on p. So the edge is directed from b to p.
Notice that if p is connected with b, then b must be well-separated from p. Moreover, the parent of
b is not well-separated from p. Therefore, if p is connected with b in BHSβ , then p is not connected
to any box in the subtree of b nor to any ancestor of b.
In addition, each box is connected directly with its parent box in the hierarchical tree and each
point p is connected its leaf-box. Both types of edges are bi-directional.
Lemma 10.5.2 Each particle is connected to at most O(log n + µ) number of boxes. So the in-
degree of BHSβ is bounded by O(log n + µ).
Notice, however, BHSβ is not suitable for parallel implementation. It has a large out-degree.
This major drawback can be illustrated by the example of n uniformly distributed particles in two
dimensions. Assume we have four processors. Then the “best” way to partition the problem is to
divide the root-box into four boxes and map each box onto a processor. Notice that in the direct
parallel implementation of BH, as modeled by BHSβ , each particle needs to access the information
of at least one boxes in each of the other processors. Because each processor has n/4 particles, the
total communication overhead is Ω(n), which is very expensive.
The main problem with BHSβ is that many particles from a processor need to access the in-
formation of the same box in some other processors (which contributes to the large out-degree).
We show that a combination technique can be used to reduce the out-degree. The idea is to com-
bine the “same” information from a box and send the information as one unit to another box on
a processor that needs the information.
√ We will show that this combination technique√ reduces
the total communication cost to O( n log n) for the four processor example, and to O( pn log n)
for p processors. Similarly, in three dimensions, the combination technique reduces the volume of
messages from Ω(n log n) to O(p1/3 n2/3 (log n)1/3 ).
We can define a graph BHPβ to model the communication and computation pattern that uses
this combination technique. Our definition of BHPβ is inspired by the communication pattern of
the refined FMM. It can be shown that the communication pattern of the refined FMM can be
used to guide the message combination for the parallel implementation of the Barnes-Hut method!
The combination technique is based on the following observation: Suppose p is well-separated
from b1 but not from the parent of b1 . Let b be the largest box that contains p such that b is
Preface 163
well-separated from b1 , using the well-separation definition in Section 10.5.2. If b is not a leaf-box,
then (b, b1 ) is a bi-directional interaction-pair in the refined FMM. If b is a leaf-box, then (b, b 1 ) is
a uni-directional interaction-pair from b1 to b. Hence (b, b1 ) is an edge of F M β . Then, any other
particle q contained in b is well-separated from b1 as well. Hence we can combine the information
from b1 to p and q and all other particles in b as follows: b1 sends its information (just one copy) to
b and b forwards the information down the hierarchical tree, to both p and q and all other particles
in b. This combination-based-communication scheme defines a new communication graph BHPβ for
parallel BH: The nodes of the graph are the union of particles and boxes, i.e., P ∪ B(P ). Each
particle is connected to the leaf-box it belongs to. Two boxes are connected iff they are connected
in the Fast-Multipole graph. However, to model the communication cost, we must introduce a
weight on each edge along the hierarchical tree embedded in BHPβ , to be equal to the number of
data units needed to be sent along that edge.
Lemma 10.5.3 The weight on each edge in BHPβ is at most O(log n + µ).
It is worthwhile to point out the difference between the comparison and communication patterns
in BH. In the sequential version of BH, if p is connected with b, then we have to compare p
against all ancestors of b in the computation. The procedure is to first compare p with the root
of the hierarchical tree, and then recursively move the comparison down the tree: if the current
box compared is not well-separated from p, then we will compare p against all its child-boxes.
However, in terms of force and potential calculation, we only evaluate a particle against the first
box down a path that is well-separated from the particle. The graphs BHSβ and BHPβ capture the
communication pattern, rather than the comparison pattern. The communication is more essential
to force or potential calculation. The construction of the communication graph has been one of the
bottlenecks in load balancing BH and FMM on a parallel machine.
near-field graph as the near-field graph, denoted by N F β . We also define the hierarchical out-degree
of a box b to be the number of edges from b to the set of non-leaf-boxes constructed above. We
can show that the hierarchical out-degree is also small.
To model the near-field communication, similar to our approach for BH, we introduce a weight
on the edges of the hierarchical tree.
Definition 10.5.1 Let α ≥ 1 be given, and let {B1 , . . . , Bn } be a k-ply box-system. The α-overlap
graph for this box-system is the undirected graph with vertices V = {1, . . . , n} and edges
The edge condition is equivalent to: (i, j) ∈ E iff the α dilation of the smaller box touches the
larger box.
As shown in [96], the partitioning algorithm and theorem of Miller et al can be extended to
overlap graphs on box-systems.
Theorem 10.5.1 Let G be an α-overlap graph over a k-ply box-system in IRd , then G can be
partitioned into two equal sized subgraphs by removing at most O(αk 1/d n1−1/d ) vertices. Moreover,
such a partitioning can be computed in linear time sequentially and in parallel O(n/p) time with p
processors.
Theorem 10.5.2 Let P = {p1 , . . . , pn } be a point set in IRd that is µ-non-uniform. Then the set
of boxes B(P ) of hierarchical tree of P is a (log 2d n + µ)-ply box-system and F M β (P ) and BHPβ (P )
are subgraphs of the 3β-overlap graph of B(P ).
Preface 165
Therefore,
Theorem 10.5.3 Let G be an N-body communication graph (either for BH or FMM) of a set of
particles located at P = {p1 , ..., pn } in IRd (d = 2 or 3). If P is µ-non-uniform, then G can
be partitioned into two equal sized subgraphs by removing at most O(n 1−1/d (log n + µ)1/d ) nodes.
Moreover, such a partitioning can be computed in linear time sequentially and in parallel O(n/p)
time with p processors.
Lecture 11
Mesh Generation
167
168 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
boundary of the domain either in the form of a continuous model or of a discretized boundary
model. Numerical requirements within the domain are typically obtained from an initial numerical
simulation on a preliminary set of points. The numerical requirements obtained from the initial
point set define an additional local spacing function restricting the final point set.
An automatic mesh generator try to generate an additional points to the internally and bound-
ary of the domain to smooth out the mesh generation and concentrate mesh density where necessary
- to optimize the total number of mesh points.
1
u(x + dx) = (xHxT ) + x∇u(x) + u(x),
2
where H is the Hessian matrix of u, the matrix of second partial derivatives. The spacing of mesh
points, required by the accuracy of the discretization at a point x, is denoted by h(x) and should
depend on the reciprocal of the square root of the largest eigenvalues of H at x.
Preface 169
When solving a PDE numerically, we estimate the eigenvalues of Hessian at a certain set of
points in the domain based on the numerical approximation of the previous iteration [4, 95]. We
then expand the spacing requirement induced by Hessian at these points over the entire domain.
For a problem with a smooth change in solution, we can use a (more-or-less) uniform mesh
where all elements are of roughly equal size. On the other hand, for problem with rapid change in
solution, such as earthquake, wave, shock modeling, we may use much dense grinding in the area
of with high intensity. See Fig 11.2.So, the information about the solution structure can be of a
great value to quality mesh generation.
Other type of information may come in the process of solving a simulation problem. For example,
in adaptive methods, we may start with a much coarse and uniform grid. We then estimate the
error of the previous step. Based on the error bound, we then adaptively refine the mesh, e.g.,
make the area with larger error much more dense for the next step calculation. As we shall argue
later, unstructured mesh generation is more about finding the proper distribution of mesh point
then the discretization itself (this is a very personal opinion).
• Unstructured grids decompose the domain into simple mesh elements such as simplices
based on a density function that is defined by the input geometry or the numerical require-
ments (e.g., from error estimation). But the associated matrices are harder and slower to
assemble compared to the previous method; the resulting linear systems are also relatively
hard to solve. Most of finite element meshes used in practice are of the unstructured type.
• Hybrid grids are generated by first decomposing the domain into non-regular domain and
then decomposing each such domain by a regular grid. Hybrid grids are often used in domain
decomposition.
Structured grids are much easy to generate and manipulate. The numerical theory of this
discretization is better understood. However, its applications is limited to problems with simple
domain and smooth changes in solution. For problems with complex geometry whose solution
changes rapidly, we need to use an unstructured mesh to reduce the problem size. For example,
170 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
when modeling earthquake we want a dense discretization near the quake center and a sparse
discretization in the regions with low activities. It would be waste to give regions with low activities
as fine a discretization as the regions with high activities. Unstructured meshes are especially
important for three dimensional problems.
The adaptability of unstructured meshes comes with new challenges, especially for 3D problems.
However, the numerical theory becomes more difficult – this is an outstanding direction for future
research; the algorithmic design becomes much harder.
900
800
700
600
500
400
300
200
With adaptive hierarchical trees, we can “optimally” approximate any geometric and numerical
spacing function. The proof of the optimality can be found in the papers of Bern, Eppstein, and
Gilbert for 2D and Mitchell and Vavasis for 3D. Formal discuss of the numerical and geometric
spacing function can be found in the point generation paper of Miller, Talmor and Teng.
The following procedure describes the basic steps of hierarchical refinement.
1. Construct the hierarchical tree for the domain so that the leaf boxes approximate the numer-
ical and geometric spacing functions.
3. Warping and triangulation: If a point is too close to a boundary of its leaf box then one of
the corners collapses to that point.
exist for Delaunay triangulations. Chew [17] and Ruppert [84] have developed Delaunay refinement
algorithms that generate provably good meshes for 2D domains.
Notice that an internal diagonal belongs to the Delaunay triangulation of four points if the sum
of the two opposing angles is less than π.
A 2D Delaunay Triangulation can be found by the following simple algorithm: FLIP algorithm
• Find any triangulation (can be done in O(n lg n) time using divide and conquer.)
• For each edge pq, let the two faces the edge is in be prq and psq. Then pq is not an local
Delaunay edge if the interior the circumscribed circle of prq contains s. Interestingly, this
condition also mean that the interior of the circumscribed circle of psq contains r and the
sum of the angles prq and psq is greater than π. We call the condition that the sum of the
angles of prq and psq is no more than π the angle condition. Then, if pq does not satisfy the
angle property, we just flip it: remove edge pq from T, and put in edge rs. Repeat this until
all edges satisfy the angle property.
It is not too hard to show that if FLIP terminates, it will output a Delaunay Triangulation.
A little addition geometric effort can show that the FLIP procedure above, fortunately, always
terminate after at most O(n2 ) flips.
The following is an interesting observation of Guibas, Knuth and Sharir. If we choose a random
ordering π of from {1, ..., n} to {1, ..., n} and permute the points based on π: pπ(1) . . . pπ(n) . We
then incrementally insert the points into the the current triangulation and perform flip if needed.
Notice that the initial triangulation is a triangle formed by the first three points. It can be shown
that the expected number of flips of the about algorithm is O(n log n). This gives a randomized
O(n log n) time DT algorithm.
• if the circum-center encroaches upon an edge of an input segment, split an edge adding its
middle point; otherwise add the circumcenter.
A point encroaches on an edge if the point is contained in the interior of the circle of which the
edge is a diameter. We can now define two operations, Split-Triangle and Split-Segment
Split-Triangle(T)
Split-Segment(S)
• Add midpoint m
• Initialize
The following theory was then stated, without proof. The proof is first given by Jim Ruppert in
his Ph.D thesis from UC. Berkeley.
Theorem 11.3.1 Not only does the Delaunay Refinement produce all triangles so that M IN α >
25◦ , the size of the mesh it produces is no more than C ∗ Optimal(size).
• Laplacian Smoothing: Move every element towards the center of mass of neighbors
• Refinement: Have a mesh, want smaller elements in a certain region. One option is to put a
circumscribed triangles inside the triangles you are refining. Another approach is to split the
longest edge on the triangles. However, you have to be careful of hanging nodes while doing
this.
Binary Classification h(x) = y ∈ ±1 (The learner classifies objects into one of two groups.)
175
176 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Discrete Classification h(x) = y ∈ {1, 2, 3, ..., n} (The learner classifies objects into one of n
groups.)
Continuous Classification h(x) = y ∈ <n (The learner classifies objects in the space of n-
dimensional real numbers.)
As an example, to learn a linear regression, x could be a collection of data points. In this case, the
y = h(x) that is learned is the best fit line through the data points.
All learning methods must also be aware of overspecifying the learned function. Given a set of
training examples, it is often inappropriate to learn a function that fits these examples too well, as
the examples often include noise. An overspecified function will often do a poor job of classifying
new data.
Figure 12.2: An example of a set of objects which are not linearly separable.
This method works well for training sets which are linearly separable. However, there are also
many cases where the training set is not linearly separable. An example is shown in Figure 12.2.
In this case, it is impossible to find a separating hyperplane between the apples and oranges.
There are several methods for dealing with this case. One method is to add slack variables, ε i
to the linear program:
kwk2 X
min +c εi subject to
w,b,ε 2 i
yi (wiT xi + b) ≥ 1 − εi , i = 1, ..., n
Slack variables allow some of the xi to “move” in space to “slip” onto the correct side of the
separating hyperplane. For instance, in Figure 12.2, the apples on the left side of the figure could
have associated εi which allow them to move to the right and into the correct category.
Another approach is to non-linearly distort space around the training set using a function Φ(x i ):
kwk2
min subject to
w,b 2
yi (wiT Φ(xi ) + b) ≥ 1, i = 1, ..., n
In many cases, this distortion moves the objects into a configuration that is more easily separated
by a hyperplane. As mentioned above, one must be careful not to overspecify Φ(x i ), as it could
create a function that is unable to cope easily with new data.
Another way to approach this problem is through the dual of the linear program shown in
Equations (12.1) and (12.2) above. If we consider those equations to be the primal, the dual is:
αT 1 − αT Hα
max subject to
α 2
yT α = 0
α≥0
Note that we have introduced Lagrange Multipliers αi for the dual problem. At optimality, we have
X
w= yi α i x i
i
178 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
Figure 12.3: An example of classifying handwritten numerals. The graphs on the right show the
probabilities of each sample being classified as each number, 0-9.
This dual problem also applies to the slack variable version using the constraints:
yT α = 0
c≥α≥0
as well as the distorted space version:
Hij = yi (Φ(xi )T Φ(xj ))yj
12.1.3 Applications
The type of classification provided by Support Vector Machines is useful in many applications. For
example, the post office must sort hundreds of thousands of hand-written envelopes every day. To
aid in this process, they make extensive use of handwriting recognition software which uses SVMs
to automatically decipher handwritten numerals. An example of this classification is shown in
Figure 12.3.
Military uses for SVMs also abound. The ability to quickly and accurately classify objects in a
noisy visual field is essential to many military operations. For instance, SVMs have been used to
identify humans or artillery against the backdrop of a crowded forest of trees.
xA
Both U and V are orthagonal matrices, that is, UT U = I and VT V = I. Σ is the singular matrix.
It is non-zero except for the diagonals, which are labeled with σi :
σ1
σ2
Σ=
···
σn
There are several interesting facts associated with the SVD of a matrix. First, the SVD is
well-defined for any matrix A of size m x n, even for m 6= n. In physical space, if a matrix A is
applied to a unit hypercircle in n dimensions, it deforms it into a hyperellipse. The diameters of
the new hyperellipse are the singular values σi . An example is shown in Figure 12.4.
The singular values σi also have a close relation to its eigenvalues λi . The following table
enumerates some of these relations:
These relationships often make it much more useful as well as more efficient to utilize the singular
value decomposition of a matrix rather than computing AT A, which is an intensive operation.
180 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
The SVD may also be used to approximate a matrix A with n singular values:
n
X
A = σi ui viT
i=1
p
X
≈ σi ui viT , p < n
i=1
where ui is the ith row of U and viT is the ith column of V. This is is also known as the “rank-p
approximation of A in the 2-norm or F-norm.”
This approximation has an interesting application for image compression. By taking an image
as a matrix of pixel values, we may find its SVD. The rank-p approximation of the image is a
compression of the image. For example, James Demmel approximated an image of his daugher for
the cover of his book Applied Numerical Linear Algebra, shown in Figure 12.5. Note that successive
approximations create horizontal and vertical “streaks” in the image.
The following MATLAB code will load an image of a clown and display its rank-p approximation:
>> image(X);
>> colormap(map);
>> [U,S,V] = svd(X);
>> p=1; image(U(:,1:p)*S(1:p,1:p)*V(:,1:p)’);
The SVD may also be used to perform latent semantic indexing, or clustering of documents
based on the words they contain. We build a matrix A which indexes the documents along one
axis and the words along the other. Aij = 1 if word j appears in document i, and 0 otherwise. By
taking the SVD of A, we can use the singular vectors to represent the “best” subset of documents
for each cluster.
Finally, the SVD has an interesting application when using the FFT matrix for parallel com-
putations. Taking the SVD of one-half of the FFT matrix results in singular values that are
approximately one-half zeros. Similarly, taking the SVD of one-quarter of the FFT matrix results
in singular values that are approximately one-quarter zeros. One can see this phenomenon with
the following MATLAB code:
>> f = fft(eye(100));
>> g = f(1:50,51:100);
>> plot(svd(g),’*’);
These near-zero values provide an opportunity for compression when communicating parts of
the FFT matrix across processors [30].
Bibliography
[1] N. Alon, P. Seymour, and R. Thomas. A separator theorem for non-planar graphs. In
Proceedings of the 22th Annual ACM Symposium on Theory of Computing, Maryland, May
1990. ACM.
[2] C. R. Anderson. An implementation of the fast multipole method without multipoles. SIAM
J. Sci. Stat. Comp., 13(4):932–947, July 1992.
[3] A. W. Appel. An efficient program for many-body simulation. SIAM J. Sci. Stat. Comput.,
6(1):85–103, 1985.
[4] I. Babuška and A.K. Aziz. On the angle condition in the finite element method. SIAM J.
Numer. Anal., 13(2):214–226, 1976.
[5] J. Barnes and P. Hut. A hierarchical O(n log n) force calculation algorithm Nature, 324
(1986) pp446-449.
[6] M. Bern, D. Eppstein, and J. R. Gilbert. Provably good mesh generation. J. Comp. Sys. Sci.
48 (1994) 384–409.
[7] M. Bern and D. Eppstein. Mesh generation and optimal triangulation. In Computing in
Euclidean Geometry, D.-Z. Du and F.K. Hwang, eds. World Scientific (1992) 23–90.
[8] M. Bern, D. Eppstein, and S.-H. Teng. Parallel construction of quadtrees and quality tri-
angulations. In Workshop on Algorithms and Data Structures, Springer LNCS 709, pages
188–199, 1993.
[9] G. Birkhoff and A. George. Elimination by nested dissection. Complexity of Sequential and
Parallel Numerical Algorithms, J. F. Traub, Academic Press, 1973.
[10] P. E. Bjørstad and O. B. Widlund. Iterative methods for the solution of elliptic problems on
regions partitioned into substructures. SIAM J. Numer. Anal., 23:1097-1120, 1986.
[11] G. E. Blelloch. Vector Models for Data-Parallel Computing. MIT-Press, Cambridge MA,
1990.
[14] C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and
Knowledge Discovery, 2(2):121–167, 1998.
[16] T. F. Chan and D. C. Resasco. A framework for the analysis and construction of domain
decomposition preconditioners. UCLA-CAM-87-09, 1987.
[18] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. North–Holland, 1978.
[19] K. Clarkson, D. Eppstein, G. L. Miller, C. Sturtivant, and S.-H. Teng. Approximating center
points with and without linear programming. In Proceedings of 9th ACM Symposium on
Computational Geometry, pages 91–98, 1993.
[20] T. Coe, Inside the Pentium FDIV bug, Dr. Dobb’s Journal 20 (April, 1995), pp 129–135.
[21] T. Coe, T. Mathisen, C. Moler, and V. Pratt, Computational aspects of the Pentium affair,
IEEE Computational Science and Engineering 2 (Spring 1995), pp 18–31.
[22] T. Coe and P. T. P. Tang, It takes six ones to reach a flaw, preprint.
[23] J. Conroy, S. Kratzer, and R. Lucas, Data parallel sparse LU factorization, in Parallel
Processing for Scientific Computing, SIAM, Philadelphia, 1994.
[25] L. Danzer, J. Fonlupt, and V. Klee. Helly’s theorem and its relatives. Proceedings of Symposia
in Pure Mathematics, American Mathematical Society, 7:101–180, 1963.
[26] J. Dongarra, R van de Geijn, and D. Walker, A look at scalable dense linear algebra libraries,
in Scalable High Performance Computer Conference, Williamsburg, VA, 1992.
[27] I. S. Duff, R. G. Grimes, and J. G. Lewis, Sparse matrix test problems, ACM TOMS, 15
(1989), pp. 1-14.
[28] A. L. Dulmage and N. S. Mendelsohn. Coverings of bipartite graphs. Canadian J. Math. 10,
pp 517-534, 1958.
[30] A. Edelman, P. McCorquodale, and S. Toledo. The future fast fourier transform. SIAM
Journal on Scientific Computing, 20(3):1094–1114, 1999.
[32] D. Eppstein, G. L. Miller, and S.-H. Teng. A deterministic linear time algorithm for geometric
separators and its applications. In Proceedings of 9th ACM Symposium on Computational
Geometry, pages 99–108, 1993.
Bibliography 185
[33] C. Farhat and M. Lesoinne. Automatic partitioning of unstructured meshes for the parallel
solution of problems in computational mechanics. Int. J. Num. Meth. Eng. 36:745-764 (1993).
[35] I. Fried. Condition of finite element matrices generated from nonuniform meshes. AIAA J.
10, pp 219–221, 1972.
[36] M. Garey and M. Johnson, Computers and Intractability: A Guide to the Theory of NP-
Completeness, Prentice-Hall, Englewood Cliffs, NJ, 1982.
[37] J. A. George. Nested dissection of a regular finite element mesh. SIAM J. Numerical Analysis,
10: 345–363, 1973.
[38] J. A. George and J. W. H. Liu. Computer Solution of Large Sparse Positive Definite Systems.
Prentice-Hall, 1981.
[39] A. George, J. W. H. Liu, and E. Ng, Communication results for parallel sparse Cholesky
factorization on a hypercube, Parallel Comput. 10 (1989), pp. 287–298.
[41] J. R. Gilbert, G. L. Miller, and S.-H. Teng. Geometric mesh partitioning: Implementation
and experiments. In SIAM J. Sci. Comp., to appear 1995.
[42] G. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix.
SIAM Journal on Numerical Analysis, 2:205–224, 1965.
[43] G. H. Golub and C. F. Van Loan. Matrix Computations, 2nd Edition. Johns Hopkins
University Press, 1989.
[44] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comp. Phys. 73
(1987) pp325-348.
[46] R. W. Hackney and J. W. Eastwood. Computer Simulation Using Particles. McGraw Hill,
1981.
[47] G. Hardy, J. E. Littlewood and G. Pólya. Inequalities. Second edition, Cambridge University
Press, 1952.
[48] D. Haussler and E. Welzl. -net and simplex range queries. Discrete and Computational
Geometry, 2: 127–151, 1987.
[49] N.J. Higham, The Accuracy of Floating Point Summation SIAM J. Scient. Comput. ,
14:783–799, 1993.
[51] T. Joachims. Text categorization with support vector machines: learning with many relevant
features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th
European Conference on Machine Learning, number 1398, pages 137–142, Chemnitz, DE,
1998. Springer Verlag, Heidelberg, DE.
[52] M. T. Jones and P. E. Plassman. Parallel algorithms for the adaptive refinement and par-
titioning of unstructured meshes. Proc. Scalable High-Performance Computing Conf. (1994)
478–485.
[54] F. T. Leighton. Complexity Issues in VLSI. Foundations of Computing. MIT Press, Cam-
bridge, MA, 1983.
[55] F. T. Leighton and S. Rao. An approximate max-flow min-cut theorem for uniform multi-
commodity flow problems with applications to approximation algorithms. In 29th Annual
Symposium on Foundations of Computer Science, pp 422-431, 1988.
[56] C. E. Leiserson. Area Efficient VLSI Computation. Foundations of Computing. MIT Press,
Cambridge, MA, 1983.
[57] C. E. Leiserson and J. G. Lewis. Orderings for parallel sparse symmetric factorization. in 3rd
SIAM Conference on Parallel Processing for Scientific Computing, 1987.
[58] G. Y. Li and T. F. Coleman, A parallel triangular solver for a distributed memory multiproces
SOR, SIAM J. Scient. Stat. Comput. 9 (1988), pp. 485–502.
[60] R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM J. of Appl.
Math., 36:177–189, April 1979.
[61] J. W. H. Liu. The solution of mesh equations on a parallel computer. in 2nd Langley
Conference on Scientific Computing, 1974.
[62] P.-F. Liu. The parallel implementation of N-body algorithms. PhD thesis, Yale University,
1994.
[63] R. Lohner, J. Camberos, and M. Merriam. Parallel unstructured grid generation. Computer
Methods in Applied Mechanics and Engineering 95 (1992) 343–357.
[64] J. Makino and M. Taiji, T. Ebisuzaki, and D. Sugimoto. Grape-4: a special-purpose computer
for gravitational N-body problems. In Parallel Processing for Scientific Computing, pages
355–360. SIAM, 1995.
[66] G. L. Miller. Finding small simple cycle separators for 2-connected planar graphs. Journal
of Computer and System Sciences, 32(3):265–279, June 1986.
Bibliography 187
[67] G. L. Miller, S.-H. Teng, W. Thurston, and S. A. Vavasis. Automatic mesh partitioning.
In A. George, J. Gilbert, and J. Liu, editors, Sparse Matrix Computations: Graph Theory
Issues and Algorithms, IMA Volumes in Mathematics and its Applications. Springer-Verlag,
pp57–84, 1993.
[68] G. L. Miller, S.-H. Teng, W. Thurston, and S. A. Vavasis. Finite element meshes and geometric
separators. SIAM J. Scientific Computing, to appear, 1995.
[69] G. L. Miller, D. Talmor, S.-H. Teng, and N. Walkington. A Delaunay Based Numerical
Method for Three Dimensions: generation, formulation, partition. In the proceedings of the
twenty-sixth annual ACM symposium on the theory of computing, to appear, 1995.
[70] S. A. Mitchell and S. A. Vavasis. Quality mesh generation in three dimensions. Proc. 8th
ACM Symp. Comput. Geom. (1992) 212–221.
[71] K. Nabors and J. White. A multipole accelerated 3-D capacitance extraction program. IEEE
Trans. Comp. Des. 10 (1991) v11.
[72] D. P. O’Leary and G. W. Stewart, Data-flow algorithms for parallel matrix computations,
CACM, 28 (1985), pp. 840–853.
[73] L.S. Ostrouchov, M.T. Heath, and C.H. Romine, Modeling speedup in parallel sparse matrix
factorization, Tech Report ORNL/TM-11786, Mathematical Sciences Section, Oak Ridge
National Lab., December, 1990.
[74] V. Pan and J. Reif. Efficient parallel solution of linear systems. In Proceedings of the 17th
Annual ACM Symposium on Theory of Computing, pages 143–152, Providence, RI, May 1985.
ACM.
[75] A. Pothen, H. D. Simon, K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs.
SIAM J. Matrix Anal. Appl. 11 (3), pp 430–452, July, 1990.
[77] V. Pratt, Anatomy of the Pentium Bug, TAPSOFT’95, LNCS 915, Springer-Verlag, Aarhus,
Denmark, (1995), 97–107.
[79] A. A. G. Requicha. Representations of rigid solids: theory, methods, and systems. In ACM
Computing Survey, 12, 437–464, 1980.
[80] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Review, 65:386–408, 1958.
[81] E. Rothberg and A. Gupta, The performance impact of data reuse in parallel dense Cholesky
factorization, Stanford Comp. Sci. Dept. Report STAN-CS-92-1401.
[82] E. Rothberg and A. Gupta, An efficient block-oriented approach to parallel sparse Cholesky
factorization, Supercomputing ’93, pp. 503-512, November, 1993.
[83] E. Rothberg and R. Schreiber, Improved load distribution in parallel sparse Cholesky factor-
ization, Supercomputing ’94, November, 1994.
188 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
[84] J. Ruppert. A new and simple algorithm for quality 2-dimensional mesh generation. Proc.
4th ACM-SIAM Symp. Discrete Algorithms (1993) 83–92.
[85] Y. Saad and M.H. Schultz, Data communication in parallel architectures, Parallel Comput.
11 (1989), pp. 131–150.
[86] J. K. Salmon. Parallel Hierarchical N-body Methods. PhD thesis, California Institute of
Technology, 1990. CRPR-90-14.
[87] J. K. Salmon, M. S. Warren, and G. S. Winckelmans. Fast parallel tree codes fro gravitational
and fluid dynamical N-body problems. Int. J. Supercomputer Applications, 8(2):129–142,
1994.
[88] H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys,
pages 188–260, 1984.
[89] K. E. Schmidt and M. A. Lee. Implementing the fast multipole method in three dimensions.
J. Stat. Phy., page 63, 1991.
[90] Sharangpani, H.P., and M.L. Barton. "Statistical analysis of floating point flaw
in the Pentium processor." Technical Report. Intel Corporation, November 1994.
[92] H. D. Simon and S.-H. Teng. How good is recursive bisection? SIAM J. Scientific Computing,
to appear, 1995.
[93] J. P. Singh, C. Holt, T. Ttsuka, A. Gupta, and J. L. Hennessey. Load balancing and data
locality in hierarchical N-body methods. Technical Report CSL-TR-92-505, Stanford, 1992.
[94] G. W. Stewart. On the early history of the singular value decomposition. Technical Report
CS-TR-2855, 1992.
[95] G. Strang and G. J. Fix. An Analysis of the Finite Element Method. Prentice-Hall, Englewood
Cliffs, New Jersey, 1973.
[96] S.-H. Teng. Points, Spheres, and Separators: a unified geometric approach to graph parti-
tioning. PhD thesis, Carnegie-Mellon University, School of Computer Science, 1991. CMU-
CS-91-184.
[97] V. Vapnik. Estimation of dependencies based on empirical data [in Russian]. 1979.
[98] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[100] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies
of events to their probabilities. Theory Probab. Appl., 16: 264-280, 1971.
[101] R. D. Williams. Performance of dynamic load balancing algorithms for unstructured mesh
calculations. Concurrency, 3 (1991) 457
[102] F. Zhao. An O(n) algorithm for three-dimensional n-body simulation. Technical Report TR
AI Memo 995, MIT, AI Lab., October 1987.
Bibliography 189
[103] F. Zhao and S. L. Johnsson. The parallel multipole method on the Connection Machines.
SIAM J. Stat. Sci. Comp., 12:1420–1437, 1991.