0% found this document useful (0 votes)
21 views

Unit V

multicore architecture

Uploaded by

poonkods3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Unit V

multicore architecture

Uploaded by

poonkods3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT V - PARALLEL PROGRAM DEVELOPMENT

Case studies - n-Body solvers – Tree Search – OpenMP and MPI implementations and comparison.

1. Case studies
Case Study 1- Parallel Sorting Using MPI
Step 1: Choosing Pivots to Define Buckets
 The first step of the algorithm is to select P-1 pivots that define the P buckets. (Bucket i will
contain elements between pivot[i-1] and pivot[i].) To do this, your code should randomly select
S samples from the entire array A, and then choose P-1 pivots from the selection using the
following process
Step 2: Bucketing Elements of the Input Array
 The second step is to bucket all elements of A into P buckets where element A[i] is placed in
bucket j if pivot[j-1] <= A[i] < pivot[j]. (The 0'th bucket contains all elements less than pivot[0],
the P-1'th bucket contains all elements greater than or equal to pivot[P-2]) The randomized
choice of pivots ensures that in expectation, the number of elements in each bucket is well
balanced. (This is important, because it will lead to good workload balance in Step 4!)
Step 3: Redistributing Elements
• Now that the bucket containing each array element is known, redistribute the data elements
such that each process i holds all the elements in bucket i.
Step 4: Final Local Sort
• Finally, each process uses a fast sequential sorting algorithm to sort each bucket. As a result, the
distributed array is now sorted!
2. n-Body solvers
The n-body problem
• Find the positions and velocities of a collection of interacting particles over a period of time.
• An n-body solver is a program that finds the solution to an n-body problem by simulating the
behavior of the particles.

Positiontime 0
N-body solver Positiontime x
mass
Velocitytime x
Velocitytime 0
Simulating motion of planets
• Determine the positions and velocities:
– Newton’s second law of motion.
– Newton’s law of universal gravitation.
 Serial pseudo-code

 Computation of the forces

 A Reduced Algorithm for Computing N-Body Forces


Parallelizing the N-Body Solvers
– Apply Foster’s methodology.
– Initially, we want a lot of tasks.
– Start by making our tasks the computations of the positions, the velocities, and the total
forces at each timestep.
Parallelizing the Reduced Solver Using OpenMP

First solution attempt


Second solution attempt

 Here we are Using one lock for each particle


 First Phase Computations for Reduced Algorithm with Block Partition

 First Phase Computations for Reduced Algorithm with Cyclic Partition

Parallelizing the Solvers Using Pthreads


• By default local variables in Pthreads are private. So all shared variables are global in the
Pthreads version.
• The principle data structures in the Pthreads version are identical to those in the OpenMP
version: vectors are two-dimensional arrays of doubles, and the mass, position, and velocity of a
single particle are stored in a struct.
The forces are stored in an array of vectors
• Startup for Pthreads is basically the same as the startup for OpenMP: the main thread gets the
command line arguments, and allocates and initializes the principle data structures.
• The main difference between the Pthreads and the OpenMP implementations is in the details of
parallelizing the inner loops.
• Since Pthreads has nothing analogous to a parallel for directive, we must explicitly determine
which values of the loop variables correspond to each thread’s calculations.

• Another difference between the Pthreads and the OpenMP versions has to do with barriers.
• At the end of a parallel for OpenMP has an implied barrier.
• We need to add explicit barriers after the inner loops when a race condition can arise.
• The Pthreads standard includes a barrier.
• If a barrier isn't defined we must define a function that uses a Pthreads condition variable to
implement a barrier.
Parallelizing the Basic Solver Using MPI
• Choices with respect to the data structures:
– Each process stores the entire global array of particle masses.
– Each process only uses a single n-element array for the positions.
– Each process uses a pointer loc_pos that refers to the start of its block of pos.
– So on process 0 local_pos = pos; on process 1 local_pos = pos + loc_n; etc.
– Pseudo-code for the MPI version of the basic n-body solver

3. Tree Search
 A graph (not to be confused with a graph in calculus) is a collection of vertices and edges or line
segments joining pairs of vertices.
 In a directed graph or digraph, the edges are oriented—one end of each edge is the tail, and
the other is the head.
 A graph or digraph is labeled if the vertices and/or edges have labels
Tree search problem (TSP)
• An NP-complete problem.
• No known solution to TSP that is better in all cases than exhaustive search.
• Ex., the travelling salesperson problem, finding a minimum cost tour.
• A Four-City TSP

 Search Tree for Four-City TSP

Recursive depth-first search


 Using depth-first search we can systematically visit each node of the tree that could
possibly lead to a least-cost solution.
 The simplest formulation of depth-first search uses recursion
 It have a definite order in which the cities are visited in the for loop in Lines 8 to 13, so we’ll
assume that the cities are visited in order of increasing index, from city 1 to city n−1.
 The algorithm makes use of several global variables:
o n: the total number of cities in the problem digraph: a data structure representing
o the input digraph hometown: a data structure representing vertex or city 0, the
salesperson’s hometown
o besttour: a data structure representing the best tour so far

Nonrecursive depth-first search

Parallelizing tree search


 The tasks will communicate down the tree edges: a parent will communicate a new partial tour to a
child, but a child, except for terminating, doesn’t communicate directly with a parent.

 Dynamic mapping of tasks


In a dynamic scheme, if one thread/process runs out of useful work, it can obtain additional work
from another thread/process. In our final implementation of serial depth-first search, each stack
record contains a partial tour
 A static parallelization of tree search using pthreads

4. OpenMP and MPI implementations and comparison.


Performance of OpenMP and Pthreads implementations of tree search

Implementation of Tree Search Using MPI and Static Partitioning


 Sending a different number of objects to each process in the communicator

 Gathering a different number of objects from each process in the communicator


 Checking to see if a message is available

 Modes and Buffered Sends


o MPI provides four modes for sends.
o Standard
o Synchronous
o Ready
o Buffered
MPI implementations
 Packing data into a buffer of contiguous memory

 Unpacking data from a buffer of contiguous memory

 Performance of MPI and Pthreads implementations of tree search


o In developing the reduced MPI solution to the n-body problem, the “ring pass”
algorithm proved to be much easier to implement and is probably more scalable.
o In a distributed memory environment in which processes send each other work,
determining when to terminate is a nontrivial problem.
o When deciding which API to use, we should consider whether to use shared- or
distributed-memory.
o We should look at the memory requirements of the application and the amount of
communication among the processes/threads.
o If the memory requirements are great or the distributed memory version can work
mainly with cache, then a distributed memory program is likely to be much faster.
o On the other hand if there is considerable communication, a shared memory program
will probably be faster.
o In choosing between OpenMP and Pthreads, if there’s an existing serial program and it
can be parallelized by the insertion of OpenMP directives, then OpenMP is probably the
clear choice.
o However, if complex thread synchronization is needed then Pthreads will be easier to
use.

You might also like