ch3 Parallel PDF
ch3 Parallel PDF
Agenda
Figure 2.17
CS4230
08/30/2012 4
Page 4 Introduction to High Performance Computing
Cache coherence
x = 2; /* shared variable */
y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???
CS4230
08/30/2012 5
Page 5 Introduction to High Performance Computing
Snooping Cache Coherence
The cores share a bus.
Any signal transmitted on the bus can be “seen” by all
cores connected to the bus.
When core 0 updates the copy of x stored in its cache
it also broadcasts this information across the bus.
If core 1 is “snooping” the bus, it will see that x has
been updated and it can mark its copy of x as invalid.
08/30/2012
CS4230 6
Page 6 Introduction to High Performance Computing
Designing and developing parallel programs has
characteristically been a very manual process. The programmer
is typically responsible for both identifying and actually
implementing parallelism.
Very often, manually developing parallel codes is a time
consuming, complex, error-prone and iterative process.
For a number of years now, various tools have been available to
assist the programmer with converting serial programs into
parallel programs. The most common type of tool used to
automatically parallelize a serial program is a parallelizing
compiler or pre-processor.
Visibility of communications
– With the Message Passing Model, communications are
explicit and generally quite visible and under the control of
the programmer.
– With the Data Parallel Model, communications often occur
transparently to the programmer, particularly on distributed
memory architectures. The programmer may not even be
able to know exactly how inter-task communications are
being accomplished.
Scope of communications
– Knowing which tasks must communicate with each other is
critical during the design stage of a parallel code. Both of the
two scopings described below can be implemented
synchronously or asynchronously.
– Point-to-point - involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer.
– Collective - involves data sharing between more than two
tasks, which are often specified as being members in a
common group, or collective.
Examples
Efficiency of communications
– Very often, the programmer will have a choice with regard to
factors that can affect communications performance. Only a
few are mentioned here.
– Which implementation for a given model should be used?
Using the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform
than another.
– What type of communication operations should be used? As
mentioned previously, asynchronous communication
operations can improve overall program performance.
– Network media - some platforms may offer more than one
network for communications. Which one is best?
Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier. It then stops, or "blocks".
– When the last task reaches the barrier, all tasks are synchronized.
– What happens from here varies. Often, a serial section of work must be done. In other
cases, the tasks are automatically released to continue their work.
Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global data or a section of code. Only
one task at a time may use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can then safely (serially) access the
protected data or code.
– Other tasks can attempt to acquire the lock but must wait until the task that owns the
lock releases it.
– Can be blocking or non-blocking
Synchronous communication operations
– Involves only those tasks executing a communication operation
– When a task performs a communication operation, some form of coordination is
required with the other task(s) participating in the communication. For example, before
a task can perform a send operation, it must first receive an acknowledgment from the
receiving task that it is OK to send.
– Discussed previously in the Communications section.
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE
task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3
Array Processing
PI Calculation
Simple Heat Equation
1-D Wave Equation
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between
0 and 1
xcoordinate = random1 ;
ycoordinate = random2
if (xcoordinate, ycoordinate)
inside circle then circle_count =
circle_count + 1
end do
PI = 4.0*circle_count/npoints