message passing-1
message passing-1
1
Outline
• MPI basics,
• Point-to-point communication,
• Collective communication,
• Synchronous/asynchronous send/receive,
• algorithms for
– gather,
– scatter,
– broadcast, and
– reduce.
Message Passing Interface (MPI) Basics
3
Message Passing Interface (MPI) Basics
4
Message Passing Interface (MPI) Basics
5
Message Passing Interface (MPI)
• MPI is a specification for the developers and users of message passing libraries.
• MPI primarily addresses the message-passing parallel programming model
- data is moved from the address space of one process to that of another process through
cooperative operations on each process.
6
Message Passing Interface (MPI)
7
Message Passing Interface (MPI)
8
Message Passing Interface (MPI)
Commands
Send a message.
Receive a message.
9
Message Passing Interface (MPI)
Commands
10
Message Passing Interface (MPI)
Commands
11
Message Passing Interface (MPI)
Commands
12
Message Passing Interface (MPI)
Commands
13
14
15
MPI_COMM_WORLD, size and
ranks
• When a program is ran with MPI all the processes are grouped in what we call a communicator.
• A communicator as a box grouping processes together, allowing them to communicate.
• Every communication is linked to a communicator, allowing the communication to reach different
processes.
• Communications can be either of two types :
• Point-to-Point : Two processes in the same communicator are going to communicate.
• Collective : All the processes in a communicator are going to communicate together.
• The default communicator is called MPI_COMM_WORLD. It basically groups all the processes when
the program started.
MPI_COMM_WORLD, size and
ranks
• MPI_COMM_WORLD is not the only communicator in MPI.
• Replace communicators by MPI_COMM_WORLD.
• The number in a communicator does not change once it is created. That number is called the size of the
communicator.
• At the same time, each process inside a communicator has a unique number to identify it.
• This number is called the rank of the process.
• The rank of each process is the number inside each circle.
• The rank of a process always ranges from 0 to size-1
Hello World Program
Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by
a consortium of academic, research, and industry partners.
MPICH: is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
Example
20
Message Passing
21
Message Passing
22
Message Passing
Process 0 Process 1
Put(data)
(memory)
(memory)
Get(data)
23
Point-to-point communication
24
Point-to-point communication
• MPI provides a set of send and receive functions that allow the
communication of typed data with an associated message tag.
• The type information is needed so that correct data representation
conversions can be performed as data is sent from one architecture to
another.
• The tag allows selectivity of messages at the receiving end.
• One can receive on a particular tag, or one can wild-card this quantity,
allowing reception of messages with any tag.
• Message selectivity on the source process of the message is also provided.
25
Point-to-point communication
• These sent, but not yet received messages are called pending messages.
• It is an important feature of MPI that pending messages are not maintained
in a simple FIFO queue.
• Instead, each pending message has several attributes and the destination
process (the receiving process) can use the attributes to determine which
message to receive.
26
Point-to-point communication
27
Point-to-point communication
28
Parallel Programming Example (With
Point to Point Communication)
Bridge Construction
• A bridge is to be assembled from girders being constructed at a foundry. These two activities
are organized by providing trucks to transport girders from the foundry to the bridge site.
• This situation is illustrated in the figure overleaf with the foundry and bridge represented as
tasks and the stream of trucks as a channel. Notice that this approach allows assembly of the
bridge and construction of girders to proceed in parallel without any explicit coordination.
• The foundry crew puts girders on trucks as they are produced, and the assembly crew adds
girders to the bridge as and when they arrive.
29
Parallel Programming Example (With
Point to Point Communication)
Bridge Construction
• The first uses a single channel on which girders generated by foundry are transported as fast
as they are generated.
– If foundry generates girders faster than they are consumed by bridge, then girders
accumulate at the construction site.
• The second solution uses a second channel to pass flow control messages from bridge to
foundry so as to avoid overflow.
30
Parallel Programming Example (With
Point to Point Communication)
31
procedure foundry(numgirders)
begin
for i = 1 to numgirders
/* Make a girder and send it */
/* MPI_SEND(buf, count, datatype, dest, tag, comm) */
MPI_SEND(i, 1, MPI_INT, 1, 0, MPI_COMM_WORLD)
endfor
i = -1 /* Send shutdown message */
MPI_SEND(i, 1, MPI_INT, 1, 0, MPI_COMM_WORLD)
end
procedure bridge
begin
/* Wait for girders and add them to the bridge */
/* MPI_RECV(buf, count, datatype, source, tag, comm, status) */
MPI_RECV(msg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, status)
while msg != -1 do
use_girder(msg)
MPI_RECV(msg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, status)
endwhile
end
Processes can use point-to-point communications operations to send a message from one named process
to another
32
MPI Blocking and Non-blocking
Blocking: Returns after local actions completed, though the message transfer
may not have been completed.
• i.e. return only when the buffer is ready to be reused.
• Collective communications in MPI are always blocking.
Non-Blocking: Returns immediately
33
Synchronous/asynchronous
Send/Receive
• In case of asynchronous send/receive, there is no synchronization
between the sending and receiving processes.
• Due to no synchronization, messages can be in pending state.
34
Collective communication
35
Collective communication
36
Collective communication
MPI_BARRIER(comm)
IN comm communicator(handle)
37
Collective communication
38
Advantages of collective
communication over point-to-point
• The possibility of error is significantly reduced: One line of code for the
call to the collective routine while in point-to-point, there are several point-
to-point calls.
• The source code is much more readable, thus simplifying code debugging
and maintenance.
• Optimized forms of the collective routines are often faster than the
equivalent operation expressed in terms of point-to-point routines.
39
Broadcast operation
The simplest kind of collective operation is the broadcast. In a broadcast operation a single process
sends a copy of some data to all the other processes in a group.
MPI_BCAST broadcasts a message from the process with rank root to all processes of the
group, itself included. It is called by all members of group using the same arguments for
comm, root. On return, the contents of root's communication buffer has been copied to all
processes.
Broadcast operation
MPI_BCAST(inbuf, incnt, intype, root, comm)
INOUT inbuf address of input buffer, or output buffer at root (choice)
IN incnt number of elements in input buffer (integer)
IN intype datatype of input buffer elements (handle)
IN root process id of root process (integer)
IN comm communicator (handle)
• This function implements a one-to-all broadcast where a single named root process
sends the same data to all other processes.
• At the time of the call, the data is located in inbuf in process root and consists of incnt
items of type intype.
• After the call the data is replicated in inbuf in all processes.
• As inbuf is used for input at the root and for output in other processes, it has type INOUT.
Broadcast Operation
Syntax:
MPI Bcast ( send buffer, send count, send type, rank, comm)
Example
send count = 1;
root = 0;
MPI Bcast ( &a, send count, MPI INT, root, comm)
Broadcast Operation
The MPI BCAST routine enables you to copy data from the memory of the root processor to
the same memory locations for other processors in the communicator.
In this example, one data value in processor 0 is broadcast to the same memory locations
in the other 3 processors. Clearly, you could send data to each processor with multiple calls
to one of the send routines. The broadcast routine makes this data motion a bit easier.
Broadcast operation: Example
#include <stdio.h> int main(int argc, char** argv) {
#include <stdlib.h> MPI_Init(NULL, NULL);
#include <mpi.h>
void my_bcast(void* data, int count, MPI_Datatype datatype, int root, int world_rank;
MPI_Comm communicator) { MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_rank;
MPI_Comm_rank(communicator, &world_rank); int data;
int world_size; if (world_rank == 0)
MPI_Comm_size(communicator, &world_size); {
data = 100;
if (world_rank == root) { printf("Process 0 broadcasting data %d\n", data);
// If we are the root process, send our data to everyone my_bcast(&data, 1, MPI_INT, 0, MPI_COMM_WORLD);
int i; }
for (i = 0; i < world_size; i++) {
else {
if (i != world_rank) {
MPI_Send(data, count, datatype, i, 0, communicator); my_bcast(&data, 1, MPI_INT, 0, MPI_COMM_WORLD);
} printf("Process %d received data %d from root process\n",
} world_rank, data);
} else { }
// If we are a receiver process, receive the data from the root
MPI_Recv(data, count, datatype, root, 0, communicator, MPI_STATUS_IGNORE); MPI_Finalize();
} }
}
mpirun -n 4 ./my_bcast
Process 0 broadcasting data 100
Process 2 received data 100 from root process
Process 3 received data 100 from root process
Process 1 received data 100 from root process
Scatter and Gather operation
Another important class of collective operations are those that distribute data from one
processor onto a group of processors or vice versa. These are called scatter and gather
operations.
• In a scatter operation, all of the data (an array of some type) are initially collected on a single
processor (the left side of the figure).
• After the scatter operation, pieces of the data are distributed on different processors (the right side
of the figure).
• The gather operation is the inverse operation to scatter: it collects pieces of the data that are
distributed
Parallel across a group of processors and reassembles them in the properorder on a single
Computing CS453
processor.
Scatter and Gather operation
Syntax:
MPI Gather ( send buffer, send count, send type, recv buffer, recv count, recv type, recv rank, comm
)
Example:
send count = 1;
recv count = 1;
recv rank = 0;
MPI Gather ( &a, send count, MPI REAL, &a, recv count, MPI REAL, recv rank, MPI COMM WORLD );
Gather operation
Here, data values A on each processor are gathered and moved to processor 0 into
contiguous memory locations.
Scatter operation
Here, four contiguous data values, elements of processor 0 beginning at A, are copied with
one element going to each processor at location A.
Syntax:
MPI Scatter ( send buffer, send count, send type, recv buffer, recv count, recv type, rank,
comm )
Example:
send count = 1;
recv count = 1;
send rank = 0;
MPI Scatter ( &a, send count, MPI REAL, &a, recv count, MPI REAL, send rank, MPI COMM
WORLD );
Reduction operation
• A reduction is a collective operation in which a single process (the root process) collects data
from the other processes in a group and combines them into a single data item.
• For example, you might use a reduction to compute the sum of the elements of an array
that is distributed over several processors.
• Operations other than arithmetic ones are also possible, for example, maximum and
minimum, as well as various logical and bit-wise operations.
The data, which may be array or scalar values, are initially distributed across the processors. After the
reduction operation, the reduced data (array or scalar) are located on the root processor.
Reduction operation
• The functions MPI_REDUCE and MPI_ALLREDUCE implement reduction
operations.
• They combine the values provided in the input buffer of each process, using a
specified operation op, and return the combined value either to the output buffer
of the single root process (in the case of MPI_REDUCE) or to the output buffer of
all processes (MPI_ALLREDUCE).
• The operation is applied pointwise to each of the count values provided by each
process.
• All operations return count values with the same datatype as the operands.
Reduction operation
MPI_REDUCE(inbuf, outbuf, count, type, op, root, comm)
MPI_ALLREDUCE(inbuf, outbuf, count, type, op, comm)
IN inbuf address of input buffer (choice)
OUT outbuf address of output buffer (choice)
IN count number of elements in input buffer (integer)
IN type datatype of input buffer elements (handle)
IN op reduction operation (handle)
IN root process id of root process (integer)
IN comm communicator (handle)
Reduction operation
The MPI REDUCE routine enables you to collect
data from each processor
reduce these data to a single value (such as a sum or max) and
store the reduced result on the root processor
Here, sums the values of A on each processor and stores results in X on processor the
32/34
Reduction operation
Syntax:
MPI Reduce ( send buffer, recv buffer, count, datatype, operation, rank, comm)
Example:
count = 1;
rank = 0;
MPI Reduce ( &a, &x, count, MPI REAL, MPI SUM, rank, MPI COMM_WORLD);
Reduction operation
Predefined operations available for MPI REDUCE:
Reduction operation
The operation MPI_MAXLOC combines pairs of values
(vi , li ) and returns the pair (v , l ) such that v is the
maximum among all vi 's and l is the smallest among
all li 's such that v = vi .
65
Example: One-Dimensional Matrix-Vector Multiplication
66
Example: One-Dimensional Matrix-Vector Multiplication
67
Example: One-Dimensional Matrix-Vector Multiplication
68
Example: One-Dimensional Matrix-Vector Multiplication
69
Example: One-Dimensional Matrix-Vector Multiplication
70
Example: One-Dimensional Matrix-Vector Multiplication
71
All-to-All communication
• Each process sends a different portion of the sendbuf array to each other process,
including itself.
• Each process sends to process i sendcount contiguous elements of type
senddatatype starting from the i * sendcount location of its sendbuf array.
• The data that are received are stored in the recvbuf array.
• Each process receives from process i recvcount elements of type recvdatatype
and stores them in its recvbuf array starting at location i * recvcount.
• MPI_Alltoall must be called by all the processes with the same values for the
sendcount, senddatatype, recvcount, recvdatatype, and comm arguments. Note
that sendcount and recvcount are the number of elements sent to, and received
from, each individual process.
72
Groups and Communicators
• In many parallel algorithms, communication operations need to be restricted
to certain subsets of processes.
• MPI provides several mechanisms for partitioning the group of processes that
belong to a communicator into subgroups each corresponding to a different
communicator.
• A general method for partitioning a graph of processes is to use
MPI_Comm_split that is defined as follows:
MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm *newcomm)
• This function is a collective operation, and thus needs to be called by all the processes in
the communicator comm.
• The function takes color and key as input parameters in addition to the communicator,
and partitions the group of processes in the communicator comm into disjoint subgroups.
• Each subgroup contains all processes that have supplied the same value for the color
parameter.
• Within each subgroup, the processes are ranked in the order defined by the value of the
key parameter, with ties broken according to their rank in the old communicator
(i.e., comm ).
• A new communicator for each subgroup is returned in the newcomm parameter.
73
Groups and Communicators
74
One-to-All Broadcast and All-to-One
Reduction
• Parallel algorithms often require a single process to send identical data to all other processes
or to a subset of them.
– This operation is known as one-to-all broadcast.
• Initially, only the source process has the data of size m that needs to be broadcast.
• At the termination of the procedure, there are p copies of the initial data – one belonging to
each process.
• The dual of one-to-all broadcast is all-to-one reduction.
• In an all-to-one reduction operation, each of the p participating processes starts with a buffer
M containing m words.
• The data from all processes are combined through an associative operator and accumulated
at a single destination process into one buffer of size m.
75
THANK YOU
76