0% found this document useful (0 votes)

21 views

Unit V

multicore architecture

Uploaded by

poonkods3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Unit V

multicore architecture

Uploaded by

poonkods3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT V - PARALLEL PROGRAM DEVELOPMENT

Case studies - n-Body solvers – Tree Search – OpenMP and MPI implementations and comparison.

1. Case studies
Case Study 1- Parallel Sorting Using MPI
Step 1: Choosing Pivots to Define Buckets
 The first step of the algorithm is to select P-1 pivots that define the P buckets. (Bucket i will
contain elements between pivot[i-1] and pivot[i].) To do this, your code should randomly select
S samples from the entire array A, and then choose P-1 pivots from the selection using the
following process
Step 2: Bucketing Elements of the Input Array
 The second step is to bucket all elements of A into P buckets where element A[i] is placed in
bucket j if pivot[j-1] <= A[i] < pivot[j]. (The 0'th bucket contains all elements less than pivot[0],
the P-1'th bucket contains all elements greater than or equal to pivot[P-2]) The randomized
choice of pivots ensures that in expectation, the number of elements in each bucket is well
balanced. (This is important, because it will lead to good workload balance in Step 4!)
Step 3: Redistributing Elements
• Now that the bucket containing each array element is known, redistribute the data elements
such that each process i holds all the elements in bucket i.
Step 4: Final Local Sort
• Finally, each process uses a fast sequential sorting algorithm to sort each bucket. As a result, the
distributed array is now sorted!
2. n-Body solvers
The n-body problem
• Find the positions and velocities of a collection of interacting particles over a period of time.
• An n-body solver is a program that finds the solution to an n-body problem by simulating the
behavior of the particles.

Positiontime 0
N-body solver Positiontime x
mass
Velocitytime x
Velocitytime 0
Simulating motion of planets
• Determine the positions and velocities:
– Newton’s second law of motion.
– Newton’s law of universal gravitation.
 Serial pseudo-code

 Computation of the forces

 A Reduced Algorithm for Computing N-Body Forces

Parallelizing the N-Body Solvers
– Apply Foster’s methodology.
– Initially, we want a lot of tasks.
– Start by making our tasks the computations of the positions, the velocities, and the total
forces at each timestep.
Parallelizing the Reduced Solver Using OpenMP

First solution attempt

Second solution attempt

 Here we are Using one lock for each particle

 First Phase Computations for Reduced Algorithm with Block Partition

 First Phase Computations for Reduced Algorithm with Cyclic Partition

Parallelizing the Solvers Using Pthreads

• By default local variables in Pthreads are private. So all shared variables are global in the
Pthreads version.
• The principle data structures in the Pthreads version are identical to those in the OpenMP
version: vectors are two-dimensional arrays of doubles, and the mass, position, and velocity of a
single particle are stored in a struct.
The forces are stored in an array of vectors
• Startup for Pthreads is basically the same as the startup for OpenMP: the main thread gets the
command line arguments, and allocates and initializes the principle data structures.
• The main difference between the Pthreads and the OpenMP implementations is in the details of
parallelizing the inner loops.
• Since Pthreads has nothing analogous to a parallel for directive, we must explicitly determine
which values of the loop variables correspond to each thread’s calculations.

• Another difference between the Pthreads and the OpenMP versions has to do with barriers.
• At the end of a parallel for OpenMP has an implied barrier.
• We need to add explicit barriers after the inner loops when a race condition can arise.
• The Pthreads standard includes a barrier.
• If a barrier isn't defined we must define a function that uses a Pthreads condition variable to
implement a barrier.
Parallelizing the Basic Solver Using MPI
• Choices with respect to the data structures:
– Each process stores the entire global array of particle masses.
– Each process only uses a single n-element array for the positions.
– Each process uses a pointer loc_pos that refers to the start of its block of pos.
– So on process 0 local_pos = pos; on process 1 local_pos = pos + loc_n; etc.
– Pseudo-code for the MPI version of the basic n-body solver

3. Tree Search
 A graph (not to be confused with a graph in calculus) is a collection of vertices and edges or line
segments joining pairs of vertices.
 In a directed graph or digraph, the edges are oriented—one end of each edge is the tail, and
the other is the head.
 A graph or digraph is labeled if the vertices and/or edges have labels
Tree search problem (TSP)
• An NP-complete problem.
• No known solution to TSP that is better in all cases than exhaustive search.
• Ex., the travelling salesperson problem, finding a minimum cost tour.
• A Four-City TSP

 Search Tree for Four-City TSP

Recursive depth-first search

 Using depth-first search we can systematically visit each node of the tree that could
possibly lead to a least-cost solution.
 The simplest formulation of depth-first search uses recursion
 It have a definite order in which the cities are visited in the for loop in Lines 8 to 13, so we’ll
assume that the cities are visited in order of increasing index, from city 1 to city n−1.
 The algorithm makes use of several global variables:
o n: the total number of cities in the problem digraph: a data structure representing
o the input digraph hometown: a data structure representing vertex or city 0, the
salesperson’s hometown
o besttour: a data structure representing the best tour so far

Nonrecursive depth-first search

Parallelizing tree search

 The tasks will communicate down the tree edges: a parent will communicate a new partial tour to a
child, but a child, except for terminating, doesn’t communicate directly with a parent.

 Dynamic mapping of tasks

In a dynamic scheme, if one thread/process runs out of useful work, it can obtain additional work
from another thread/process. In our final implementation of serial depth-first search, each stack
record contains a partial tour
 A static parallelization of tree search using pthreads

4. OpenMP and MPI implementations and comparison.

Performance of OpenMP and Pthreads implementations of tree search

Implementation of Tree Search Using MPI and Static Partitioning

 Sending a different number of objects to each process in the communicator

 Gathering a different number of objects from each process in the communicator

 Checking to see if a message is available

 Modes and Buffered Sends

o MPI provides four modes for sends.
o Standard
o Synchronous
o Ready
o Buffered
MPI implementations
 Packing data into a buffer of contiguous memory

 Unpacking data from a buffer of contiguous memory

 Performance of MPI and Pthreads implementations of tree search

o In developing the reduced MPI solution to the n-body problem, the “ring pass”
algorithm proved to be much easier to implement and is probably more scalable.
o In a distributed memory environment in which processes send each other work,
determining when to terminate is a nontrivial problem.
o When deciding which API to use, we should consider whether to use shared- or
distributed-memory.
o We should look at the memory requirements of the application and the amount of
communication among the processes/threads.
o If the memory requirements are great or the distributed memory version can work
mainly with cache, then a distributed memory program is likely to be much faster.
o On the other hand if there is considerable communication, a shared memory program
will probably be faster.
o In choosing between OpenMP and Pthreads, if there’s an existing serial program and it
can be parallelized by the insertion of OpenMP directives, then OpenMP is probably the
clear choice.
o However, if complex thread synchronization is needed then Pthreads will be easier to
use.

Préparation Entrevue STM
No ratings yet
Préparation Entrevue STM
3 pages
Case Studies - N-Body Solvers - Tree Search - Openmp and Mpi Implementations and Comparison
No ratings yet
Case Studies - N-Body Solvers - Tree Search - Openmp and Mpi Implementations and Comparison
12 pages
Unit5 RMD PDF
No ratings yet
Unit5 RMD PDF
27 pages
Parallelizing Particle-In-Cell Codes With Openmp and Mpi: Nils Magnus Larsgård
No ratings yet
Parallelizing Particle-In-Cell Codes With Openmp and Mpi: Nils Magnus Larsgård
74 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Multi Core Architectures and Programming
No ratings yet
Multi Core Architectures and Programming
10 pages
Tree Search Using MPI With Static and Dynamic Partitioning PDF
No ratings yet
Tree Search Using MPI With Static and Dynamic Partitioning PDF
9 pages
Pseudo Code of Mpi Programs
No ratings yet
Pseudo Code of Mpi Programs
22 pages
Assignment 1_3810b8880ca1a62a43dab0a77528084a
No ratings yet
Assignment 1_3810b8880ca1a62a43dab0a77528084a
4 pages
Worksharing and Parallel Loops
No ratings yet
Worksharing and Parallel Loops
23 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Untitled document
No ratings yet
Untitled document
23 pages
PDC Report
No ratings yet
PDC Report
22 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
HPC_SUMMARY
No ratings yet
HPC_SUMMARY
17 pages
Mpi
No ratings yet
Mpi
46 pages
HPC 2025 (1)
No ratings yet
HPC 2025 (1)
16 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
PDC (Assignment Q.3-5) Affan
No ratings yet
PDC (Assignment Q.3-5) Affan
4 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Parallel Programming 3
No ratings yet
Parallel Programming 3
22 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
Wa0009.
No ratings yet
Wa0009.
27 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
ProgrammingModelExamples ECMWF
No ratings yet
ProgrammingModelExamples ECMWF
7 pages
Lp5 Dl Hpc Lab Manual
No ratings yet
Lp5 Dl Hpc Lab Manual
60 pages
Lab Manual - LP V - LA 3.docx
No ratings yet
Lab Manual - LP V - LA 3.docx
14 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
CS462 Project Report: Name: Samuel Day 1. The Nature of The Project
No ratings yet
CS462 Project Report: Name: Samuel Day 1. The Nature of The Project
7 pages
Lab 1
No ratings yet
Lab 1
2 pages
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
Tutorial 4
No ratings yet
Tutorial 4
32 pages
PRACE_2012-02_MPI_OpenMP_Rabenseifner
No ratings yet
PRACE_2012-02_MPI_OpenMP_Rabenseifner
171 pages
NumPy Recipes
From Everand
NumPy Recipes
Martin McBride
No ratings yet
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
Lammps Overdrive
No ratings yet
Lammps Overdrive
28 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Openmp
No ratings yet
Openmp
61 pages
End Sem Lab Exam q Paper
No ratings yet
End Sem Lab Exam q Paper
3 pages
Ref 3
No ratings yet
Ref 3
3 pages
hpc-Neal
No ratings yet
hpc-Neal
32 pages
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
No ratings yet
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
7 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
Migration To Multicore: Tools That Can Help: Tasneem G. Brutch
No ratings yet
Migration To Multicore: Tools That Can Help: Tasneem G. Brutch
10 pages
Unit IV
No ratings yet
Unit IV
12 pages
OpenMP 4
No ratings yet
OpenMP 4
18 pages
BE LP5 Manual 23-24
No ratings yet
BE LP5 Manual 23-24
67 pages
CS621 Final Term Current Papers
No ratings yet
CS621 Final Term Current Papers
9 pages
omp_hands_on
No ratings yet
omp_hands_on
200 pages
OpenMP_SPM
No ratings yet
OpenMP_SPM
9 pages
MPIreport
No ratings yet
MPIreport
4 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
HPC Project Mpi
No ratings yet
HPC Project Mpi
17 pages
Using MPI Portable Programming With The Message Pa PDF
No ratings yet
Using MPI Portable Programming With The Message Pa PDF
8 pages
Gender
No ratings yet
Gender
5 pages
Paragraph - Essay
No ratings yet
Paragraph - Essay
12 pages
FC Workbook PDF
No ratings yet
FC Workbook PDF
79 pages
Connect With TIBCO-HCL
No ratings yet
Connect With TIBCO-HCL
119 pages
EC106 Advance Digital Signal Processing Lab Manual On Digital Signal Processing
0% (1)
EC106 Advance Digital Signal Processing Lab Manual On Digital Signal Processing
69 pages
NLP UNIT-II(PART-I)
No ratings yet
NLP UNIT-II(PART-I)
19 pages
The Ultimate Docker Cheat Sheet
No ratings yet
The Ultimate Docker Cheat Sheet
12 pages
Essential Idioms in English: Phrasal Verbs and Collocations Fifth Edition Robert James Dixson instant download
100% (4)
Essential Idioms in English: Phrasal Verbs and Collocations Fifth Edition Robert James Dixson instant download
60 pages
English Grammar Summary
No ratings yet
English Grammar Summary
3 pages
University of Birmingham - Adam - Myths About Sign Language
No ratings yet
University of Birmingham - Adam - Myths About Sign Language
2 pages
Armadilly Chili
No ratings yet
Armadilly Chili
2 pages
Male Female Symbol
No ratings yet
Male Female Symbol
6 pages
jQuery and JavaScript Phrasebook 1st Edition by Brad Dayley ISBN 0133410854 9780133410853 - The ebook with rich content is ready for you to download
100% (7)
jQuery and JavaScript Phrasebook 1st Edition by Brad Dayley ISBN 0133410854 9780133410853 - The ebook with rich content is ready for you to download
80 pages
Handwriting Speed Assessment
0% (1)
Handwriting Speed Assessment
4 pages
English MidTerm Exercises
No ratings yet
English MidTerm Exercises
4 pages
12 History Key Notes CH 06 Bhakti Sufi Traditions
No ratings yet
12 History Key Notes CH 06 Bhakti Sufi Traditions
2 pages
Three I Allproofs
No ratings yet
Three I Allproofs
63 pages
1 s2.0 S187705091931885X Main
No ratings yet
1 s2.0 S187705091931885X Main
8 pages
Editorial
No ratings yet
Editorial
4 pages
Look & Listen But???
No ratings yet
Look & Listen But???
7 pages
AW - WS1DM - Lab Manual Days 3 - 4
No ratings yet
AW - WS1DM - Lab Manual Days 3 - 4
118 pages
SP-3000P_InstructionManual_Software
No ratings yet
SP-3000P_InstructionManual_Software
30 pages
Allie Kingsley: Personal Statement Education
No ratings yet
Allie Kingsley: Personal Statement Education
1 page
Sem 8 Internship Report - 200220131127
No ratings yet
Sem 8 Internship Report - 200220131127
63 pages
Fce Speaking Test
100% (1)
Fce Speaking Test
5 pages
Present Tenses: There Are 2 Basic Ways To Talk About Present
No ratings yet
Present Tenses: There Are 2 Basic Ways To Talk About Present
13 pages
Grade 6 Math Test 6
100% (6)
Grade 6 Math Test 6
6 pages
Method in Java
No ratings yet
Method in Java
9 pages
Basic and Compound Proposition
No ratings yet
Basic and Compound Proposition
31 pages

Unit V

Uploaded by

Unit V

Uploaded by

UNIT V - PARALLEL PROGRAM DEVELOPMENT

 Computation of the forces

 A Reduced Algorithm for Computing N-Body Forces

First solution attempt

 Here we are Using one lock for each particle

 First Phase Computations for Reduced Algorithm with Cyclic Partition

Parallelizing the Solvers Using Pthreads

 Search Tree for Four-City TSP

Recursive depth-first search

Nonrecursive depth-first search

Parallelizing tree search

 Dynamic mapping of tasks

4. OpenMP and MPI implementations and comparison.

Implementation of Tree Search Using MPI and Static Partitioning

 Gathering a different number of objects from each process in the communicator

 Modes and Buffered Sends

 Unpacking data from a buffer of contiguous memory

 Performance of MPI and Pthreads implementations of tree search

You might also like