0% found this document useful (1 vote)
87 views

ch3 Parallel PDF

This document outlines the key steps in designing parallel programs, including: understanding the problem and identifying opportunities for parallelism; partitioning the problem into discrete chunks of work; managing communication and synchronization between partitions; addressing data dependencies and load balancing; optimizing granularity; and performing performance analysis and tuning. Automatic parallelization tools can assist but have limitations, so manual parallelization requires understanding these design concepts.

Uploaded by

Moe Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
87 views

ch3 Parallel PDF

This document outlines the key steps in designing parallel programs, including: understanding the problem and identifying opportunities for parallelism; partitioning the problem into discrete chunks of work; managing communication and synchronization between partitions; addressing data dependencies and load balancing; optimizing granularity; and performing performance analysis and tuning. Automatic parallelization tools can assist but have limitations, so manual parallelization requires understanding these design concepts.

Uploaded by

Moe Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Designing Parallel Programs

Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 2 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 3 Introduction to High Performance Computing


Cache coherence

 Programmers have no direct


control over caches
and when they get updated.
 However, they can organize
their computation to access
memory in a different order.

Figure 2.17

A shared memory system with two


cores and two caches

CS4230
08/30/2012 4
Page 4 Introduction to High Performance Computing
Cache coherence

y0 privately owned by Core 0


y1 and z1 privately owned by Core 1

x = 2; /* shared variable */

y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???

CS4230
08/30/2012 5
Page 5 Introduction to High Performance Computing
Snooping Cache Coherence
 The cores share a bus.
 Any signal transmitted on the bus can be “seen” by all
cores connected to the bus.
 When core 0 updates the copy of x stored in its cache
it also broadcasts this information across the bus.
 If core 1 is “snooping” the bus, it will see that x has
been updated and it can mark its copy of x as invalid.

08/30/2012
CS4230 6
Page 6 Introduction to High Performance Computing
 Designing and developing parallel programs has
characteristically been a very manual process. The programmer
is typically responsible for both identifying and actually
implementing parallelism.
 Very often, manually developing parallel codes is a time
consuming, complex, error-prone and iterative process.
 For a number of years now, various tools have been available to
assist the programmer with converting serial programs into
parallel programs. The most common type of tool used to
automatically parallelize a serial program is a parallelizing
compiler or pre-processor.

Page 7 Introduction to High Performance Computing


 A parallelizing compiler generally works in two different ways:
– Fully Automatic
 The compiler analyzes the source code and identifies opportunities for
parallelism.
 The analysis includes a cost weighting on whether or not the
parallelism would actually improve performance.
 Loops (do, for) loops are the most frequent target for automatic
parallelization.
– Programmer Directed
 Using "compiler directives" or possibly compiler flags, the programmer
explicitly tells the compiler how to parallelize the code.
 May be able to be used in conjunction with some degree of automatic
parallelization also.

Page 8 Introduction to High Performance Computing


 If you are beginning with an existing serial code and have time
or budget constraints, then automatic parallelization may be the
answer. However, there are several important limitations that
apply to automatic parallelization:
– Wrong results may be produced
– Performance may actually degrade
– Much less flexible than manual parallelization
– Limited to a subset (mostly loops) of code
– Most automatic parallelization tools are for Fortran
 The remainder of this section applies to the manual method of
developing parallel codes.

Page 9 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 10 Introduction to High Performance Computing


 Undoubtedly, the first step in developing parallel
software is to first understand the problem that you
wish to solve in parallel. If you are starting with a
serial program, this necessitates understanding the
existing code also.
 Before spending time in an attempt to develop a
parallel solution for a problem, determine whether or
not the problem is one that can actually be
parallelized.

Page 11 Introduction to High Performance Computing


Example of Parallelizable Problem

Calculate the potential energy for each of several


thousand independent conformations of a molecule.
When done, find the minimum energy conformation.

 This problem is able to be solved in parallel. Each of


the molecular conformations is independently
determinable. The calculation of the minimum energy
conformation is also a parallelizable problem.

Page 12 Introduction to High Performance Computing


Example of a Non-parallelizable Problem

Calculation of the Fibonacci series


(1,1,2,3,5,8,13,21,...) by use of the formula:
F(k + 2) = F(k + 1) + F(k)

 This is a non-parallelizable problem because the


calculation of the Fibonacci sequence as shown
would entail dependent calculations rather than
independent ones. The calculation of the k + 2 value
uses those of both k + 1 and k. These three terms
cannot be calculated independently and therefore,
not in parallel.

Page 13 Introduction to High Performance Computing


Identify the program's hotspots

 Know where most of the real work is being done. The


majority of scientific and technical programs usually
accomplish most of their work in a few places.
 Profilers and performance analysis tools can help
here
 Focus on parallelizing the hotspots and ignore those
sections of the program that account for little CPU
usage.

Page 14 Introduction to High Performance Computing


Identify bottlenecks in the program

 Are there areas that are disproportionately slow, or


cause parallelizable work to halt or be deferred? For
example, I/O is usually something that slows a
program down.
 May be possible to restructure the program or use a
different algorithm to reduce or eliminate
unnecessary slow areas

Page 15 Introduction to High Performance Computing


Other considerations

 Identify inhibitors to parallelism. One common class


of inhibitor is data dependence, as demonstrated by
the Fibonacci sequence above.
 Investigate other algorithms if possible. This may be
the single most important consideration when
designing a parallel application.

Page 16 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 17 Introduction to High Performance Computing


 One of the first steps in designing a parallel program
is to break the problem into discrete "chunks" of work
that can be distributed to multiple tasks. This is
known as decomposition or partitioning.
 There are two basic ways to partition computational
work among parallel tasks:
– domain decomposition
and
– functional decomposition

Page 18 Introduction to High Performance Computing


Domain Decomposition

 In this type of partitioning, the data associated with a


problem is decomposed. Each parallel task then
works on a portion of of the data.

Page 19 Introduction to High Performance Computing


Partitioning Data

 There are different ways to partition data

Page 20 Introduction to High Performance Computing


Functional Decomposition
 In this approach, the focus is on the computation that is to be
performed rather than on the data manipulated by the
computation. The problem is decomposed according to the work
that must be done. Each task then performs a portion of the
overall work.
 Functional decomposition lends itself well to problems that can
be split into different tasks. For example
– Ecosystem Modeling
– Signal Processing
– Climate Modeling

Page 21 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 22 Introduction to High Performance Computing


Who Needs Communications?

 The need for communications between tasks depends upon your


problem
 You DON'T need communications
– Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. For example, imagine an image
processing operation where every pixel in a black and white image needs to
have its color reversed. The image data can easily be distributed to multiple
tasks that then act independently of each other to do their portion of the
work.
– These types of problems are often called embarrassingly parallel because
they are so straight-forward. Very little inter-task communication is required.
 You DO need communications
– Most parallel applications are not quite so simple, and do require tasks to
share data with each other.

Page 23 Introduction to High Performance Computing


Factors to Consider (1)

 There are a number of important factors to consider


when designing your program's inter-task
communications
 Cost of communications
– Inter-task communication virtually always implies overhead.
– Machine cycles and resources that could be used for
computation are instead used to package and transmit data.
– Communications frequently require some type of
synchronization between tasks, which can result in tasks
spending time "waiting" instead of doing work.
– Competing communication traffic can saturate the available
network bandwidth, further aggravating performance
problems.

Page 24 Introduction to High Performance Computing


Factors to Consider (2)

 Latency vs. Bandwidth


– latency is the time it takes to send a minimal (0 byte)
message from point A to point B. Commonly expressed as
microseconds.
– bandwidth is the amount of data that can be communicated
per unit of time. Commonly expressed as megabytes/sec.
– Sending many small messages can cause latency to
dominate communication overheads. Often it is more
efficient to package small messages into a larger message,
thus increasing the effective communications bandwidth.

Page 25 Introduction to High Performance Computing


Factors to Consider (3)

 Visibility of communications
– With the Message Passing Model, communications are
explicit and generally quite visible and under the control of
the programmer.
– With the Data Parallel Model, communications often occur
transparently to the programmer, particularly on distributed
memory architectures. The programmer may not even be
able to know exactly how inter-task communications are
being accomplished.

Page 26 Introduction to High Performance Computing


Factors to Consider (4)

 Synchronous vs. asynchronous communications


– Synchronous communications require some type of "handshaking"
between tasks that are sharing data. This can be explicitly
structured in code by the programmer, or it may happen at a lower
level unknown to the programmer.
– Synchronous communications are often referred to as blocking
communications since other work must wait until the
communications have completed.
– Asynchronous communications allow tasks to transfer data
independently from one another. For example, task 1 can prepare
and send a message to task 2, and then immediately begin doing
other work. When task 2 actually receives the data doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.

Page 27 Introduction to High Performance Computing


Factors to Consider (5)

 Scope of communications
– Knowing which tasks must communicate with each other is
critical during the design stage of a parallel code. Both of the
two scopings described below can be implemented
synchronously or asynchronously.
– Point-to-point - involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer.
– Collective - involves data sharing between more than two
tasks, which are often specified as being members in a
common group, or collective.

Page 28 Introduction to High Performance Computing


Collective Communications

 Examples

Page 29 Introduction to High Performance Computing


Factors to Consider (6)

 Efficiency of communications
– Very often, the programmer will have a choice with regard to
factors that can affect communications performance. Only a
few are mentioned here.
– Which implementation for a given model should be used?
Using the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform
than another.
– What type of communication operations should be used? As
mentioned previously, asynchronous communication
operations can improve overall program performance.
– Network media - some platforms may offer more than one
network for communications. Which one is best?

Page 30 Introduction to High Performance Computing


Factors to Consider (7)

 Overhead and Complexity

Page 31 Introduction to High Performance Computing


Factors to Consider (8)

 Finally, realize that this is only a partial list of


things to consider!!!

Page 32 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 33 Introduction to High Performance Computing


Types of Synchronization

 Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier. It then stops, or "blocks".
– When the last task reaches the barrier, all tasks are synchronized.
– What happens from here varies. Often, a serial section of work must be done. In other
cases, the tasks are automatically released to continue their work.
 Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global data or a section of code. Only
one task at a time may use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can then safely (serially) access the
protected data or code.
– Other tasks can attempt to acquire the lock but must wait until the task that owns the
lock releases it.
– Can be blocking or non-blocking
 Synchronous communication operations
– Involves only those tasks executing a communication operation
– When a task performs a communication operation, some form of coordination is
required with the other task(s) participating in the communication. For example, before
a task can perform a send operation, it must first receive an acknowledgment from the
receiving task that it is OK to send.
– Discussed previously in the Communications section.

Page 34 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 35 Introduction to High Performance Computing


Definitions

 A dependence exists between program statements


when the order of statement execution affects the
results of the program.
 A data dependence results from multiple use of the
same location(s) in storage by different tasks.
 Dependencies are important to parallel programming
because they are one of the primary inhibitors to
parallelism.

Page 36 Introduction to High Performance Computing


Examples (1): Loop carried data dependence

DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE

 The value of A(J-1) must be computed before the


value of A(J), therefore A(J) exhibits a data
dependency on A(J-1). Parallelism is inhibited.
 If Task 2 has A(J) and task 1 has A(J-1), computing
the correct value of A(J) necessitates:
– Distributed memory architecture - task 2 must obtain the
value of A(J-1) from task 1 after task 1 finishes its
computation
– Shared memory architecture - task 2 must read A(J-1) after
task 1 updates it

Page 37 Introduction to High Performance Computing


Examples (2): Loop independent data dependence

task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3

 As with the previous example, parallelism is inhibited. The value


of Y is dependent on:
– Distributed memory architecture - if or when the value of X is
communicated between the tasks.
– Shared memory architecture - which task last stores the value of X.
 Although all data dependencies are important to identify when
designing parallel programs, loop carried dependencies are
particularly important since loops are possibly the most common
target of parallelization efforts.

Page 38 Introduction to High Performance Computing


How to Handle Data Dependencies?

 Distributed memory architectures - communicate


required data at synchronization points.
 Shared memory architectures -synchronize read/write
operations between tasks.

Page 39 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 40 Introduction to High Performance Computing


Definition

 Load balancing refers to the practice of distributing work among


tasks so that all tasks are kept busy all of the time. It can be
considered a minimization of task idle time.
 Load balancing is important to parallel programs for
performance reasons. For example, if all tasks are subject to a
barrier synchronization point, the slowest task will determine the
overall performance.

Page 41 Introduction to High Performance Computing


How to Achieve Load Balance? (1)

 Equally partition the work each task receives


– For array/matrix operations where each task performs similar
work, evenly distribute the data set among the tasks.
– For loop iterations where the work done in each iteration is
similar, evenly distribute the iterations across the tasks.
– If a heterogeneous mix of machines with varying
performance characteristics are being used, be sure to use
some type of performance analysis tool to detect any load
imbalances. Adjust work accordingly.

Page 42 Introduction to High Performance Computing


How to Achieve Load Balance? (2)

 Use dynamic work assignment


– Certain classes of problems result in load imbalances even if data
is evenly distributed among tasks:
 Sparse arrays - some tasks will have actual data to work on while
others have mostly "zeros".
 Adaptive grid methods - some tasks may need to refine their mesh
while others don't.
 N-body simulations - where some particles may migrate to/from their
original task domain to another task's; where the particles owned by
some tasks require more work than those owned by other tasks.
– When the amount of work each task will perform is intentionally
variable, or is unable to be predicted, it may be helpful to use a
scheduler - task pool approach. As each task finishes its work, it
queues to get a new piece of work.
– It may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code.

Page 43 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 44 Introduction to High Performance Computing


Definitions

 Computation / Communication Ratio:


– In parallel computing, granularity is a qualitative measure of
the ratio of computation to communication.
– Periods of computation are typically separated from periods
of communication by synchronization events.
 Fine grain parallelism
 Coarse grain parallelism

Page 45 Introduction to High Performance Computing


Fine-grain Parallelism

 Relatively small amounts of computational work


are done between communication events
 Low computation to communication ratio
 Facilitates load balancing
 Implies high communication overhead and less
opportunity for performance enhancement
 If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer
than the computation.

Page 46 Introduction to High Performance Computing


Coarse-grain Parallelism

 Relatively large amounts of


computational work are done between
communication/synchronization events
 High computation to communication
ratio
 Implies more opportunity for
performance increase
 Harder to load balance efficiently

Page 47 Introduction to High Performance Computing


Which is Best?

 The most efficient granularity is dependent on the


algorithm and the hardware environment in which it
runs.
 In most cases the overhead associated with
communications and synchronization is high relative
to execution speed so it is advantageous to have
coarse granularity.
 Fine-grain parallelism can help reduce overheads
due to load imbalance.

Page 48 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 49 Introduction to High Performance Computing


The bad News

 I/O operations are generally regarded as inhibitors to


parallelism
 Parallel I/O systems are immature or not available for
all platforms
 In an environment where all tasks see the same
filespace, write operations will result in file overwriting
 Read operations will be affected by the fileserver's
ability to handle multiple read requests at the same
time
 I/O that must be conducted over the network (NFS,
non-local) can cause severe bottlenecks

Page 50 Introduction to High Performance Computing


The good News

 Some parallel file systems are available. For example:


– GPFS: General Parallel File System for AIX (IBM)
– Lustre: for Linux clusters (Cluster File Systems, Inc.)
– PVFS/PVFS2: Parallel Virtual File System for Linux clusters
(Clemson/Argonne/Ohio State/others)
– PanFS: Panasas ActiveScale File System for Linux clusters
(Panasas, Inc.)
– HP SFS: HP StorageWorks Scalable File Share. Lustre based
parallel file system (Global File System for Linux) product from HP
 The parallel I/O programming interface specification for MPI has
been available since 1996 as part of MPI-2. Vendor and "free"
implementations are now commonly available.

Page 51 Introduction to High Performance Computing


Some Options

 If you have access to a parallel file system, investigate using it.


If you don't, keep reading...
 Rule #1: Reduce overall I/O as much as possible
 Confine I/O to specific serial portions of the job, and then use
parallel communications to distribute data to parallel tasks. For
example, Task 1 could read an input file and then communicate
required data to other tasks. Likewise, Task 1 could perform
write operation after receiving required data from all other tasks.
 For distributed memory systems with shared filespace, perform
I/O in local, non-shared filespace. For example, each processor
may have /tmp filespace which can used. This is usually much
more efficient than performing I/O over the network to one's
home directory.
 Create unique filenames for each tasks' input/output file(s)

Page 52 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 53 Introduction to High Performance Computing


Amdahl's Law

Amdahl's Law states that potential


program speedup is defined by the
fraction of code (P) that can be
parallelized:
1
speedup = --------
1 - P

 If none of the code can be parallelized, P = 0 and the


speedup = 1 (no speedup). If all of the code is
parallelized, P = 1 and the speedup is infinite (in
theory).
 If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.

Page 54 Introduction to High Performance Computing


Amdahl's Law

 Introducing the number of processors performing the


parallel fraction of work, the relationship can be
modeled by
1
speedup = ------------
P + S
---
N

 where P = parallel fraction, N = number of processors


and S = serial fraction

Page 55 Introduction to High Performance Computing


Amdahl's Law

 It soon becomes obvious that there are limits to the


scalability of parallelism. For example, at P = .50, .90
and .99 (50%, 90% and 99% of the code is
parallelizable)
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02

Page 56 Introduction to High Performance Computing


Complexity

 In general, parallel applications are much more complex than


corresponding serial applications, perhaps an order of
magnitude. Not only do you have multiple instruction streams
executing at the same time, but you also have data flowing
between them.
 The costs of complexity are measured in programmer time in
virtually every aspect of the software development cycle:
– Design
– Coding
– Debugging
– Tuning
– Maintenance
 Adhering to "good" software development practices is essential
when when working with parallel applications - especially if
somebody besides you will have to work with the software.

Page 57 Introduction to High Performance Computing


Resource Requirements

 The primary intent of parallel programming is to decrease


execution wall clock time, however in order to accomplish this,
more CPU time is required. For example, a parallel code that
runs in 1 hour on 8 processors actually uses 8 hours of CPU
time.
 The amount of memory required can be greater for parallel
codes than serial codes, due to the need to replicate data and
for overheads associated with parallel support libraries and
subsystems.
 For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting up
the parallel environment, task creation, communications and
task termination can comprise a significant portion of the total
execution time for short runs.

Page 58 Introduction to High Performance Computing


Scalability

 The ability of a parallel program's performance to scale is a


result of a number of interrelated factors. Simply adding more
machines is rarely the answer.
 The algorithm may have inherent limits to scalability. At some
point, adding more resources causes performance to decrease.
Most parallel solutions demonstrate this characteristic at some
point.
 Hardware factors play a significant role in scalability. Examples:
– Memory-cpu bus bandwidth on an SMP machine
– Communications network bandwidth
– Amount of memory available on any given machine or set of
machines
– Processor clock speed
 Parallel support libraries and subsystems software can limit
scalability independent of your application.

Page 59 Introduction to High Performance Computing


Agenda

 Automatic vs. Manual Parallelization


 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 60 Introduction to High Performance Computing


 As with debugging, monitoring and analyzing parallel
program execution is significantly more of a
challenge than for serial programs.
 A number of parallel tools for execution monitoring
and program analysis are available.
 Some are quite useful; some are cross-platform also.
 One starting point: Performance Analysis Tools
Tutorial
 Work remains to be done, particularly in the area of
scalability.

Page 61 Introduction to High Performance Computing


Parallel Examples
Agenda

 Array Processing
 PI Calculation
 Simple Heat Equation
 1-D Wave Equation

Page 63 Introduction to High Performance Computing


Array Processing

 This example demonstrates calculations on 2-dimensional array


elements, with the computation on each array element being
independent from other array elements.
 The serial program calculates one element at a time in
sequential order.
 Serial code could be of the form:
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 The calculation of elements is independent of one another -
leads to an embarrassingly parallel situation.
 The problem should be computationally intensive.

Page 64 Introduction to High Performance Computing


Array Processing Solution 1

 Arrays elements are distributed so that each processor owns a portion of an


array (subarray).
 Independent calculation of array elements insures there is no need for
communication between tasks.
 Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1)
through the subarrays. Unit stride maximizes cache/memory usage.
 Since it is desirable to have unit stride through the subarrays, the choice of a
distribution scheme depends on the programming language. See the Block -
Cyclic Distributions Diagram for the options.
 After the array is distributed, each task executes the portion of the loop
corresponding to the data it owns. For example, with Fortran block distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 Notice that only the outer loop variables are different from the serial solution.

Page 65 Introduction to High Performance Computing


Array Processing Solution 1
One possible implementation

 Implement as SPMD model.


 Master process initializes array, sends info to worker
processes and receives results.
 Worker process receives info, performs its share of
computation and sends results to master.
 Using the Fortran storage scheme, perform block
distribution of the array.
 Pseudo code solution: red highlights changes for
parallelism.

Page 66 Introduction to High Performance Computing


Array Processing Solution 1
One possible implementation

Page 67 Introduction to High Performance Computing


Array Processing Solution 2: Pool of Tasks

 The previous array solution demonstrated static load


balancing:
– Each task has a fixed amount of work to do
– May be significant idle time for faster or more lightly loaded
processors - slowest tasks determines overall performance.
 Static load balancing is not usually a major concern if
all tasks are performing the same amount of work on
identical machines.
 If you have a load balance problem (some tasks work
faster than others), you may benefit by using a "pool
of tasks" scheme.

Page 68 Introduction to High Performance Computing


Array Processing Solution 2
Pool of Tasks Scheme

 Two processes are employed


 Master Process:
– Holds pool of tasks for worker processes to do
– Sends worker a task when requested
– Collects results from workers
 Worker Process: repeatedly does the following
– Gets task from master process
– Performs computation
– Sends results to master
 Worker processes do not know before runtime which portion of
array they will handle or how many tasks they will perform.
 Dynamic load balancing occurs at run time: the faster tasks will
get more work to do.
 Pseudo code solution: red highlights changes for parallelism.

Page 69 Introduction to High Performance Computing


Array Processing Solution 2 Pool of Tasks Scheme

Page 70 Introduction to High Performance Computing


Pi Calculation

 The value of PI can be calculated in a number of


ways. Consider the following method of
approximating PI
– Inscribe a circle in a square
– Randomly generate points in the square
– Determine the number of points in the square that are also in
the circle
– Let r be the number of points in the circle divided by the
number of points in the square
– PI ~ 4 r
 Note that the more points generated, the better the
approximation

Page 71 Introduction to High Performance Computing


Discussion

 In the above pool of tasks example, each task


calculated an individual array element as a job. The
computation to communication ratio is finely granular.
 Finely granular solutions incur more communication
overhead in order to reduce task idle time.
 A more optimal solution might be to distribute more
work with each job. The "right" amount of work is
problem dependent.

Page 72 Introduction to High Performance Computing


Algorithm

npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between
0 and 1
xcoordinate = random1 ;
ycoordinate = random2
if (xcoordinate, ycoordinate)
inside circle then circle_count =
circle_count + 1
end do
PI = 4.0*circle_count/npoints

 Note that most of the time in running this program


would be spent executing the loop
 Leads to an embarrassingly parallel solution
- Computationally intensive
- Minimal communication
- Minimal I/O
Page 73 Introduction to High Performance Computing
PI Calculation
Parallel Solution

 Parallel strategy: break the loop into


portions that can be executed by the
tasks.
 For the task of approximating PI:
– Each task executes its portion of the
loop a number of times.
– Each task can do its work without
requiring any information from the
other tasks (there are no data
dependencies).
– Uses the SPMD model. One task
acts as master and collects the
results.
 Pseudo code solution: red highlights
changes for parallelism.

Page 74 Introduction to High Performance Computing


PI Calculation
Parallel Solution

Page 75 Introduction to High Performance Computing


This ends this tutorial

Page 76 Introduction to High Performance Computing

You might also like