0% found this document useful (1 vote)

87 views

ch3 Parallel PDF

This document outlines the key steps in designing parallel programs, including: understanding the problem and identifying opportunities for parallelism; partitioning the problem into discrete chunks of work; managing communication and synchronization between partitions; addressing data dependencies and load balancing; optimizing granularity; and performing performance analysis and tuning. Automatic parallelization tools can assist but have limitations, so manual parallelization requires understanding these design concepts.

Uploaded by

Moe Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

87 views

ch3 Parallel PDF

Uploaded by

Moe Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Designing Parallel Programs

Agenda

 Automatic vs. Manual Parallelization

 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 2 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 3 Introduction to High Performance Computing

Cache coherence

 Programmers have no direct

control over caches
and when they get updated.
 However, they can organize
their computation to access
memory in a different order.

Figure 2.17

A shared memory system with two

cores and two caches

CS4230
08/30/2012 4
Page 4 Introduction to High Performance Computing
Cache coherence

y0 privately owned by Core 0

y1 and z1 privately owned by Core 1

x = 2; /* shared variable */

y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???

CS4230
08/30/2012 5
Page 5 Introduction to High Performance Computing
Snooping Cache Coherence
 The cores share a bus.
 Any signal transmitted on the bus can be “seen” by all
cores connected to the bus.
 When core 0 updates the copy of x stored in its cache
it also broadcasts this information across the bus.
 If core 1 is “snooping” the bus, it will see that x has
been updated and it can mark its copy of x as invalid.

08/30/2012
CS4230 6
Page 6 Introduction to High Performance Computing
 Designing and developing parallel programs has
characteristically been a very manual process. The programmer
is typically responsible for both identifying and actually
implementing parallelism.
 Very often, manually developing parallel codes is a time
consuming, complex, error-prone and iterative process.
 For a number of years now, various tools have been available to
assist the programmer with converting serial programs into
parallel programs. The most common type of tool used to
automatically parallelize a serial program is a parallelizing
compiler or pre-processor.

Page 7 Introduction to High Performance Computing

 A parallelizing compiler generally works in two different ways:
– Fully Automatic
 The compiler analyzes the source code and identifies opportunities for
parallelism.
 The analysis includes a cost weighting on whether or not the
parallelism would actually improve performance.
 Loops (do, for) loops are the most frequent target for automatic
parallelization.
– Programmer Directed
 Using "compiler directives" or possibly compiler flags, the programmer
explicitly tells the compiler how to parallelize the code.
 May be able to be used in conjunction with some degree of automatic
parallelization also.

Page 8 Introduction to High Performance Computing

 If you are beginning with an existing serial code and have time
or budget constraints, then automatic parallelization may be the
answer. However, there are several important limitations that
apply to automatic parallelization:
– Wrong results may be produced
– Performance may actually degrade
– Much less flexible than manual parallelization
– Limited to a subset (mostly loops) of code
– Most automatic parallelization tools are for Fortran
 The remainder of this section applies to the manual method of
developing parallel codes.

Page 9 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 10 Introduction to High Performance Computing

 Undoubtedly, the first step in developing parallel
software is to first understand the problem that you
wish to solve in parallel. If you are starting with a
serial program, this necessitates understanding the
existing code also.
 Before spending time in an attempt to develop a
parallel solution for a problem, determine whether or
not the problem is one that can actually be
parallelized.

Page 11 Introduction to High Performance Computing

Example of Parallelizable Problem

Calculate the potential energy for each of several

thousand independent conformations of a molecule.
When done, find the minimum energy conformation.

 This problem is able to be solved in parallel. Each of

the molecular conformations is independently
determinable. The calculation of the minimum energy
conformation is also a parallelizable problem.

Page 12 Introduction to High Performance Computing

Example of a Non-parallelizable Problem

Calculation of the Fibonacci series

(1,1,2,3,5,8,13,21,...) by use of the formula:
F(k + 2) = F(k + 1) + F(k)

 This is a non-parallelizable problem because the

calculation of the Fibonacci sequence as shown
would entail dependent calculations rather than
independent ones. The calculation of the k + 2 value
uses those of both k + 1 and k. These three terms
cannot be calculated independently and therefore,
not in parallel.

Page 13 Introduction to High Performance Computing

Identify the program's hotspots

 Know where most of the real work is being done. The

majority of scientific and technical programs usually
accomplish most of their work in a few places.
 Profilers and performance analysis tools can help
here
 Focus on parallelizing the hotspots and ignore those
sections of the program that account for little CPU
usage.

Page 14 Introduction to High Performance Computing

Identify bottlenecks in the program

 Are there areas that are disproportionately slow, or

cause parallelizable work to halt or be deferred? For
example, I/O is usually something that slows a
program down.
 May be possible to restructure the program or use a
different algorithm to reduce or eliminate
unnecessary slow areas

Page 15 Introduction to High Performance Computing

Other considerations

 Identify inhibitors to parallelism. One common class

of inhibitor is data dependence, as demonstrated by
the Fibonacci sequence above.
 Investigate other algorithms if possible. This may be
the single most important consideration when
designing a parallel application.

Page 16 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 17 Introduction to High Performance Computing

 One of the first steps in designing a parallel program
is to break the problem into discrete "chunks" of work
that can be distributed to multiple tasks. This is
known as decomposition or partitioning.
 There are two basic ways to partition computational
work among parallel tasks:
– domain decomposition
and
– functional decomposition

Page 18 Introduction to High Performance Computing

Domain Decomposition

 In this type of partitioning, the data associated with a

problem is decomposed. Each parallel task then
works on a portion of of the data.

Page 19 Introduction to High Performance Computing

Partitioning Data

 There are different ways to partition data

Page 20 Introduction to High Performance Computing

Functional Decomposition
 In this approach, the focus is on the computation that is to be
performed rather than on the data manipulated by the
computation. The problem is decomposed according to the work
that must be done. Each task then performs a portion of the
overall work.
 Functional decomposition lends itself well to problems that can
be split into different tasks. For example
– Ecosystem Modeling
– Signal Processing
– Climate Modeling

Page 21 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 22 Introduction to High Performance Computing

Who Needs Communications?

 The need for communications between tasks depends upon your

problem
 You DON'T need communications
– Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. For example, imagine an image
processing operation where every pixel in a black and white image needs to
have its color reversed. The image data can easily be distributed to multiple
tasks that then act independently of each other to do their portion of the
work.
– These types of problems are often called embarrassingly parallel because
they are so straight-forward. Very little inter-task communication is required.
 You DO need communications
– Most parallel applications are not quite so simple, and do require tasks to
share data with each other.

Page 23 Introduction to High Performance Computing

Factors to Consider (1)

 There are a number of important factors to consider

when designing your program's inter-task
communications
 Cost of communications
– Inter-task communication virtually always implies overhead.
– Machine cycles and resources that could be used for
computation are instead used to package and transmit data.
– Communications frequently require some type of
synchronization between tasks, which can result in tasks
spending time "waiting" instead of doing work.
– Competing communication traffic can saturate the available
network bandwidth, further aggravating performance
problems.

Page 24 Introduction to High Performance Computing

Factors to Consider (2)

 Latency vs. Bandwidth

– latency is the time it takes to send a minimal (0 byte)
message from point A to point B. Commonly expressed as
microseconds.
– bandwidth is the amount of data that can be communicated
per unit of time. Commonly expressed as megabytes/sec.
– Sending many small messages can cause latency to
dominate communication overheads. Often it is more
efficient to package small messages into a larger message,
thus increasing the effective communications bandwidth.

Page 25 Introduction to High Performance Computing

Factors to Consider (3)

 Visibility of communications
– With the Message Passing Model, communications are
explicit and generally quite visible and under the control of
the programmer.
– With the Data Parallel Model, communications often occur
transparently to the programmer, particularly on distributed
memory architectures. The programmer may not even be
able to know exactly how inter-task communications are
being accomplished.

Page 26 Introduction to High Performance Computing

Factors to Consider (4)

 Synchronous vs. asynchronous communications

– Synchronous communications require some type of "handshaking"
between tasks that are sharing data. This can be explicitly
structured in code by the programmer, or it may happen at a lower
level unknown to the programmer.
– Synchronous communications are often referred to as blocking
communications since other work must wait until the
communications have completed.
– Asynchronous communications allow tasks to transfer data
independently from one another. For example, task 1 can prepare
and send a message to task 2, and then immediately begin doing
other work. When task 2 actually receives the data doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.

Page 27 Introduction to High Performance Computing

Factors to Consider (5)

 Scope of communications
– Knowing which tasks must communicate with each other is
critical during the design stage of a parallel code. Both of the
two scopings described below can be implemented
synchronously or asynchronously.
– Point-to-point - involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer.
– Collective - involves data sharing between more than two
tasks, which are often specified as being members in a
common group, or collective.

Page 28 Introduction to High Performance Computing

Collective Communications

 Examples

Page 29 Introduction to High Performance Computing

Factors to Consider (6)

 Efficiency of communications
– Very often, the programmer will have a choice with regard to
factors that can affect communications performance. Only a
few are mentioned here.
– Which implementation for a given model should be used?
Using the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform
than another.
– What type of communication operations should be used? As
mentioned previously, asynchronous communication
operations can improve overall program performance.
– Network media - some platforms may offer more than one
network for communications. Which one is best?

Page 30 Introduction to High Performance Computing

Factors to Consider (7)

 Overhead and Complexity

Page 31 Introduction to High Performance Computing

Factors to Consider (8)

 Finally, realize that this is only a partial list of

things to consider!!!

Page 32 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 33 Introduction to High Performance Computing

Types of Synchronization

 Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier. It then stops, or "blocks".
– When the last task reaches the barrier, all tasks are synchronized.
– What happens from here varies. Often, a serial section of work must be done. In other
cases, the tasks are automatically released to continue their work.
 Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global data or a section of code. Only
one task at a time may use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can then safely (serially) access the
protected data or code.
– Other tasks can attempt to acquire the lock but must wait until the task that owns the
lock releases it.
– Can be blocking or non-blocking
 Synchronous communication operations
– Involves only those tasks executing a communication operation
– When a task performs a communication operation, some form of coordination is
required with the other task(s) participating in the communication. For example, before
a task can perform a send operation, it must first receive an acknowledgment from the
receiving task that it is OK to send.
– Discussed previously in the Communications section.

Page 34 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 35 Introduction to High Performance Computing

Definitions

 A dependence exists between program statements

when the order of statement execution affects the
results of the program.
 A data dependence results from multiple use of the
same location(s) in storage by different tasks.
 Dependencies are important to parallel programming
because they are one of the primary inhibitors to
parallelism.

Page 36 Introduction to High Performance Computing

Examples (1): Loop carried data dependence

DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE

 The value of A(J-1) must be computed before the

value of A(J), therefore A(J) exhibits a data
dependency on A(J-1). Parallelism is inhibited.
 If Task 2 has A(J) and task 1 has A(J-1), computing
the correct value of A(J) necessitates:
– Distributed memory architecture - task 2 must obtain the
value of A(J-1) from task 1 after task 1 finishes its
computation
– Shared memory architecture - task 2 must read A(J-1) after
task 1 updates it

Page 37 Introduction to High Performance Computing

Examples (2): Loop independent data dependence

task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3

 As with the previous example, parallelism is inhibited. The value

of Y is dependent on:
– Distributed memory architecture - if or when the value of X is
communicated between the tasks.
– Shared memory architecture - which task last stores the value of X.
 Although all data dependencies are important to identify when
designing parallel programs, loop carried dependencies are
particularly important since loops are possibly the most common
target of parallelization efforts.

Page 38 Introduction to High Performance Computing

How to Handle Data Dependencies?

 Distributed memory architectures - communicate

required data at synchronization points.
 Shared memory architectures -synchronize read/write
operations between tasks.

Page 39 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 40 Introduction to High Performance Computing

Definition

 Load balancing refers to the practice of distributing work among

tasks so that all tasks are kept busy all of the time. It can be
considered a minimization of task idle time.
 Load balancing is important to parallel programs for
performance reasons. For example, if all tasks are subject to a
barrier synchronization point, the slowest task will determine the
overall performance.

Page 41 Introduction to High Performance Computing

How to Achieve Load Balance? (1)

 Equally partition the work each task receives

– For array/matrix operations where each task performs similar
work, evenly distribute the data set among the tasks.
– For loop iterations where the work done in each iteration is
similar, evenly distribute the iterations across the tasks.
– If a heterogeneous mix of machines with varying
performance characteristics are being used, be sure to use
some type of performance analysis tool to detect any load
imbalances. Adjust work accordingly.

Page 42 Introduction to High Performance Computing

How to Achieve Load Balance? (2)

 Use dynamic work assignment

– Certain classes of problems result in load imbalances even if data
is evenly distributed among tasks:
 Sparse arrays - some tasks will have actual data to work on while
others have mostly "zeros".
 Adaptive grid methods - some tasks may need to refine their mesh
while others don't.
 N-body simulations - where some particles may migrate to/from their
original task domain to another task's; where the particles owned by
some tasks require more work than those owned by other tasks.
– When the amount of work each task will perform is intentionally
variable, or is unable to be predicted, it may be helpful to use a
scheduler - task pool approach. As each task finishes its work, it
queues to get a new piece of work.
– It may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code.

Page 43 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 44 Introduction to High Performance Computing

Definitions

 Computation / Communication Ratio:

– In parallel computing, granularity is a qualitative measure of
the ratio of computation to communication.
– Periods of computation are typically separated from periods
of communication by synchronization events.
 Fine grain parallelism
 Coarse grain parallelism

Page 45 Introduction to High Performance Computing

Fine-grain Parallelism

 Relatively small amounts of computational work

are done between communication events
 Low computation to communication ratio
 Facilitates load balancing
 Implies high communication overhead and less
opportunity for performance enhancement
 If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer
than the computation.

Page 46 Introduction to High Performance Computing

Coarse-grain Parallelism

 Relatively large amounts of

computational work are done between
communication/synchronization events
 High computation to communication
ratio
 Implies more opportunity for
performance increase
 Harder to load balance efficiently

Page 47 Introduction to High Performance Computing

Which is Best?

 The most efficient granularity is dependent on the

algorithm and the hardware environment in which it
runs.
 In most cases the overhead associated with
communications and synchronization is high relative
to execution speed so it is advantageous to have
coarse granularity.
 Fine-grain parallelism can help reduce overheads
due to load imbalance.

Page 48 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 49 Introduction to High Performance Computing

The bad News

 I/O operations are generally regarded as inhibitors to

parallelism
 Parallel I/O systems are immature or not available for
all platforms
 In an environment where all tasks see the same
filespace, write operations will result in file overwriting
 Read operations will be affected by the fileserver's
ability to handle multiple read requests at the same
time
 I/O that must be conducted over the network (NFS,
non-local) can cause severe bottlenecks

Page 50 Introduction to High Performance Computing

The good News

 Some parallel file systems are available. For example:

– GPFS: General Parallel File System for AIX (IBM)
– Lustre: for Linux clusters (Cluster File Systems, Inc.)
– PVFS/PVFS2: Parallel Virtual File System for Linux clusters
(Clemson/Argonne/Ohio State/others)
– PanFS: Panasas ActiveScale File System for Linux clusters
(Panasas, Inc.)
– HP SFS: HP StorageWorks Scalable File Share. Lustre based
parallel file system (Global File System for Linux) product from HP
 The parallel I/O programming interface specification for MPI has
been available since 1996 as part of MPI-2. Vendor and "free"
implementations are now commonly available.

Page 51 Introduction to High Performance Computing

Some Options

 If you have access to a parallel file system, investigate using it.

If you don't, keep reading...
 Rule #1: Reduce overall I/O as much as possible
 Confine I/O to specific serial portions of the job, and then use
parallel communications to distribute data to parallel tasks. For
example, Task 1 could read an input file and then communicate
required data to other tasks. Likewise, Task 1 could perform
write operation after receiving required data from all other tasks.
 For distributed memory systems with shared filespace, perform
I/O in local, non-shared filespace. For example, each processor
may have /tmp filespace which can used. This is usually much
more efficient than performing I/O over the network to one's
home directory.
 Create unique filenames for each tasks' input/output file(s)

Page 52 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 53 Introduction to High Performance Computing

Amdahl's Law

Amdahl's Law states that potential

program speedup is defined by the
fraction of code (P) that can be
parallelized:
1
speedup = --------
1 - P

 If none of the code can be parallelized, P = 0 and the

speedup = 1 (no speedup). If all of the code is
parallelized, P = 1 and the speedup is infinite (in
theory).
 If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.

Page 54 Introduction to High Performance Computing

Amdahl's Law

 Introducing the number of processors performing the

parallel fraction of work, the relationship can be
modeled by
1
speedup = ------------
P + S
---
N

 where P = parallel fraction, N = number of processors

and S = serial fraction

Page 55 Introduction to High Performance Computing

Amdahl's Law

 It soon becomes obvious that there are limits to the

scalability of parallelism. For example, at P = .50, .90
and .99 (50%, 90% and 99% of the code is
parallelizable)
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02

Page 56 Introduction to High Performance Computing

Complexity

 In general, parallel applications are much more complex than

corresponding serial applications, perhaps an order of
magnitude. Not only do you have multiple instruction streams
executing at the same time, but you also have data flowing
between them.
 The costs of complexity are measured in programmer time in
virtually every aspect of the software development cycle:
– Design
– Coding
– Debugging
– Tuning
– Maintenance
 Adhering to "good" software development practices is essential
when when working with parallel applications - especially if
somebody besides you will have to work with the software.

Page 57 Introduction to High Performance Computing

Resource Requirements

 The primary intent of parallel programming is to decrease

execution wall clock time, however in order to accomplish this,
more CPU time is required. For example, a parallel code that
runs in 1 hour on 8 processors actually uses 8 hours of CPU
time.
 The amount of memory required can be greater for parallel
codes than serial codes, due to the need to replicate data and
for overheads associated with parallel support libraries and
subsystems.
 For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting up
the parallel environment, task creation, communications and
task termination can comprise a significant portion of the total
execution time for short runs.

Page 58 Introduction to High Performance Computing

Scalability

 The ability of a parallel program's performance to scale is a

result of a number of interrelated factors. Simply adding more
machines is rarely the answer.
 The algorithm may have inherent limits to scalability. At some
point, adding more resources causes performance to decrease.
Most parallel solutions demonstrate this characteristic at some
point.
 Hardware factors play a significant role in scalability. Examples:
– Memory-cpu bus bandwidth on an SMP machine
– Communications network bandwidth
– Amount of memory available on any given machine or set of
machines
– Processor clock speed
 Parallel support libraries and subsystems software can limit
scalability independent of your application.

Page 59 Introduction to High Performance Computing

Agenda

 Automatic vs. Manual Parallelization

Page 60 Introduction to High Performance Computing

 As with debugging, monitoring and analyzing parallel
program execution is significantly more of a
challenge than for serial programs.
 A number of parallel tools for execution monitoring
and program analysis are available.
 Some are quite useful; some are cross-platform also.
 One starting point: Performance Analysis Tools
Tutorial
 Work remains to be done, particularly in the area of
scalability.

Page 61 Introduction to High Performance Computing

Parallel Examples
Agenda

 Array Processing
 PI Calculation
 Simple Heat Equation
 1-D Wave Equation

Page 63 Introduction to High Performance Computing

Array Processing

 This example demonstrates calculations on 2-dimensional array

elements, with the computation on each array element being
independent from other array elements.
 The serial program calculates one element at a time in
sequential order.
 Serial code could be of the form:
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 The calculation of elements is independent of one another -
leads to an embarrassingly parallel situation.
 The problem should be computationally intensive.

Page 64 Introduction to High Performance Computing

Array Processing Solution 1

 Arrays elements are distributed so that each processor owns a portion of an

array (subarray).
 Independent calculation of array elements insures there is no need for
communication between tasks.
 Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1)
through the subarrays. Unit stride maximizes cache/memory usage.
 Since it is desirable to have unit stride through the subarrays, the choice of a
distribution scheme depends on the programming language. See the Block -
Cyclic Distributions Diagram for the options.
 After the array is distributed, each task executes the portion of the loop
corresponding to the data it owns. For example, with Fortran block distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 Notice that only the outer loop variables are different from the serial solution.

Page 65 Introduction to High Performance Computing

Array Processing Solution 1
One possible implementation

 Implement as SPMD model.

 Master process initializes array, sends info to worker
processes and receives results.
 Worker process receives info, performs its share of
computation and sends results to master.
 Using the Fortran storage scheme, perform block
distribution of the array.
 Pseudo code solution: red highlights changes for
parallelism.

Page 66 Introduction to High Performance Computing

Array Processing Solution 1
One possible implementation

Page 67 Introduction to High Performance Computing

Array Processing Solution 2: Pool of Tasks

 The previous array solution demonstrated static load

balancing:
– Each task has a fixed amount of work to do
– May be significant idle time for faster or more lightly loaded
processors - slowest tasks determines overall performance.
 Static load balancing is not usually a major concern if
all tasks are performing the same amount of work on
identical machines.
 If you have a load balance problem (some tasks work
faster than others), you may benefit by using a "pool
of tasks" scheme.

Page 68 Introduction to High Performance Computing

Array Processing Solution 2
Pool of Tasks Scheme

 Two processes are employed

 Master Process:
– Holds pool of tasks for worker processes to do
– Sends worker a task when requested
– Collects results from workers
 Worker Process: repeatedly does the following
– Gets task from master process
– Performs computation
– Sends results to master
 Worker processes do not know before runtime which portion of
array they will handle or how many tasks they will perform.
 Dynamic load balancing occurs at run time: the faster tasks will
get more work to do.
 Pseudo code solution: red highlights changes for parallelism.

Page 69 Introduction to High Performance Computing

Array Processing Solution 2 Pool of Tasks Scheme

Page 70 Introduction to High Performance Computing

Pi Calculation

 The value of PI can be calculated in a number of

ways. Consider the following method of
approximating PI
– Inscribe a circle in a square
– Randomly generate points in the square
– Determine the number of points in the square that are also in
the circle
– Let r be the number of points in the circle divided by the
number of points in the square
– PI ~ 4 r
 Note that the more points generated, the better the
approximation

Page 71 Introduction to High Performance Computing

Discussion

 In the above pool of tasks example, each task

calculated an individual array element as a job. The
computation to communication ratio is finely granular.
 Finely granular solutions incur more communication
overhead in order to reduce task idle time.
 A more optimal solution might be to distribute more
work with each job. The "right" amount of work is
problem dependent.

Page 72 Introduction to High Performance Computing

Algorithm

npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between
0 and 1
xcoordinate = random1 ;
ycoordinate = random2
if (xcoordinate, ycoordinate)
inside circle then circle_count =
circle_count + 1
end do
PI = 4.0*circle_count/npoints

 Note that most of the time in running this program

would be spent executing the loop
 Leads to an embarrassingly parallel solution
- Computationally intensive
- Minimal communication
- Minimal I/O
Page 73 Introduction to High Performance Computing
PI Calculation
Parallel Solution

 Parallel strategy: break the loop into

portions that can be executed by the
tasks.
 For the task of approximating PI:
– Each task executes its portion of the
loop a number of times.
– Each task can do its work without
requiring any information from the
other tasks (there are no data
dependencies).
– Uses the SPMD model. One task
acts as master and collects the
results.
 Pseudo code solution: red highlights
changes for parallelism.

Page 74 Introduction to High Performance Computing

PI Calculation
Parallel Solution

Page 75 Introduction to High Performance Computing

This ends this tutorial

Page 76 Introduction to High Performance Computing

SD-40 Prior SN 152827 Excl. SN 152328
100% (6)
SD-40 Prior SN 152827 Excl. SN 152328
190 pages
NVIDIA Questions
0% (1)
NVIDIA Questions
5 pages
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
No ratings yet
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
18 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
4-DesigningParallelPrograms
No ratings yet
4-DesigningParallelPrograms
69 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
How To Parallelise An Application
No ratings yet
How To Parallelise An Application
30 pages
Lecture 1 - Introduction To Parallel Computing
0% (1)
Lecture 1 - Introduction To Parallel Computing
32 pages
CS526 3 Design of Parallel Programs
No ratings yet
CS526 3 Design of Parallel Programs
83 pages
High Performance Computing (HPC) - Lec2
No ratings yet
High Performance Computing (HPC) - Lec2
53 pages
E- Notes -HPC-Unit 3-1
No ratings yet
E- Notes -HPC-Unit 3-1
26 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Lecture-4 Parallel hardware-Jameel-NNL
No ratings yet
Lecture-4 Parallel hardware-Jameel-NNL
39 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
Lecture 2 - Parallel Programming Platforms (Part I)
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I)
44 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Computação Paralela
No ratings yet
Computação Paralela
18 pages
03-Task Decomposition and Mapping
No ratings yet
03-Task Decomposition and Mapping
62 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Mastering Dynamic Programming in Python
From Everand
Mastering Dynamic Programming in Python
Ed A Norex
No ratings yet
in3200-chap05
No ratings yet
in3200-chap05
34 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
Parallel Computing Simply in Depth by Ajit Singh PDF
No ratings yet
Parallel Computing Simply in Depth by Ajit Singh PDF
125 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Introduction To Parallel Co...
No ratings yet
Introduction To Parallel Co...
44 pages
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Introduction to Parallel Computing 2nd Edition Ananth Grama download
100% (1)
Introduction to Parallel Computing 2nd Edition Ananth Grama download
57 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
1 Introduction To Parallel Computing
No ratings yet
1 Introduction To Parallel Computing
58 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
01-Parallel Computing
No ratings yet
01-Parallel Computing
7 pages
14013204-3 - Parallel Computing - Lecture1_ (4)
No ratings yet
14013204-3 - Parallel Computing - Lecture1_ (4)
52 pages
P 1
No ratings yet
P 1
44 pages
Parallel Computation Lecture Notes
No ratings yet
Parallel Computation Lecture Notes
44 pages
p1
No ratings yet
p1
30 pages
L01-slides
No ratings yet
L01-slides
24 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
High Performance Computing ChapterSampler
No ratings yet
High Performance Computing ChapterSampler
124 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Parallel Performance Analysis and Tuning
No ratings yet
Parallel Performance Analysis and Tuning
8 pages
Week 7
No ratings yet
Week 7
27 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
W1 Intro.4u
No ratings yet
W1 Intro.4u
7 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
Partitioning
No ratings yet
Partitioning
37 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021
44 pages
Introduction to Parallel Computing 2nd Edition Ananth Grama - Quickly download the ebook to never miss important content
100% (2)
Introduction to Parallel Computing 2nd Edition Ananth Grama - Quickly download the ebook to never miss important content
54 pages
HPC Fall 2010: Prof. Robert Van Engelen
No ratings yet
HPC Fall 2010: Prof. Robert Van Engelen
35 pages
HPC Note
No ratings yet
HPC Note
39 pages
Barlas Exercises Ch1
No ratings yet
Barlas Exercises Ch1
1 page
Array VP5 Datasheet
No ratings yet
Array VP5 Datasheet
2 pages
Common-Duty-Ratio Control of Input-Series Connected Modular DC-DC Converters With Active Input Voltage and Load-Current Sharing
No ratings yet
Common-Duty-Ratio Control of Input-Series Connected Modular DC-DC Converters With Active Input Voltage and Load-Current Sharing
11 pages
Checkpoint EASSHOP Katalog 2015
No ratings yet
Checkpoint EASSHOP Katalog 2015
71 pages
Sample
No ratings yet
Sample
98 pages
Computer Networks
No ratings yet
Computer Networks
24 pages
Utility Programs
No ratings yet
Utility Programs
30 pages
Bomba P6010
No ratings yet
Bomba P6010
22 pages
Data Services Code Migration
No ratings yet
Data Services Code Migration
8 pages
CNC and Machining - Manual
No ratings yet
CNC and Machining - Manual
47 pages
Ztree 2 Stata
No ratings yet
Ztree 2 Stata
3 pages
What Every CEO Need To Know Aout The Cloud
No ratings yet
What Every CEO Need To Know Aout The Cloud
18 pages
2018 Dewalt Power Tools Catalog
No ratings yet
2018 Dewalt Power Tools Catalog
17 pages
Assembly Constraints - Conical Faces
No ratings yet
Assembly Constraints - Conical Faces
6 pages
Avaya VPN Gateway For Vmware: Key Benefits
No ratings yet
Avaya VPN Gateway For Vmware: Key Benefits
4 pages
Topic Cutting Fluid
No ratings yet
Topic Cutting Fluid
12 pages
Wall Climbing Robot
No ratings yet
Wall Climbing Robot
32 pages
3 EDC Lab Manual
No ratings yet
3 EDC Lab Manual
73 pages
Video Cap X
No ratings yet
Video Cap X
94 pages
Girder in The Real World
No ratings yet
Girder in The Real World
15 pages
Calculation For Stair Tower
No ratings yet
Calculation For Stair Tower
46 pages
RN Ammer
No ratings yet
RN Ammer
2 pages
SNESTL12
No ratings yet
SNESTL12
6 pages
Config Coova Https Wiki Openwrt Org Doc Howto Wireless Hotspot Coova Chilli
No ratings yet
Config Coova Https Wiki Openwrt Org Doc Howto Wireless Hotspot Coova Chilli
2 pages
NDT20-00-01 Rev0
No ratings yet
NDT20-00-01 Rev0
22 pages
List of Equipments Ec Lab
No ratings yet
List of Equipments Ec Lab
1 page
Getting To Know The Work Area
No ratings yet
Getting To Know The Work Area
24 pages
Telephone Wiring Basics
No ratings yet
Telephone Wiring Basics
8 pages