0% found this document useful (0 votes)
16 views40 pages

Module 4

The document outlines a course module on Shared Memory Programming with OpenMP, focusing on how to apply OpenMP pragmas and directives to parallelize code. It covers various topics including OpenMP pragmas, the trapezoidal rule, reduction clauses, loop scheduling, and tasking, emphasizing the importance of thread safety and cache coherence. The course aims to equip students with the skills to effectively implement parallel programming techniques using OpenMP.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views40 pages

Module 4

The document outlines a course module on Shared Memory Programming with OpenMP, focusing on how to apply OpenMP pragmas and directives to parallelize code. It covers various topics including OpenMP pragmas, the trapezoidal rule, reduction clauses, loop scheduling, and tasking, emphasizing the importance of thread safety and cache coherence. The course aims to equip students with the skills to effectively implement parallel programming techniques using OpenMP.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module -4

Shared Memory Programming with OpenMP

Course Outcome:
At the end of the course, the student will be able to apply OpenMP
pragma and directives to parallelize the code to solve the given
problem
Topic

Introduction

OpenMP Pragmas and Directives

Trapezoidal Rule

The Reduction Clause

Loop Carried Dependency

Scheduling

Cache Coherence & False Sharing

Tasking

Thread Safety
Introduction

OpenMP is an “directive-based” API for Shared-Memory MIMD
Programming

It allows programmers to incrementally parallelize existing serial programs

It allows the programmer to specify the block of code to be executed
parallely

However, the parallel execution of code by threads is taken care by
Compiler and Run-time System

It requires a C compiler that supports OpenMP
Introduction

Shared-memory programs uses “Fork / Join” parallelism

At the beginning, a thread called “Master Thread” is active and
executes serial portion of the program

When parallel operations are to be executed, it creates(forks)
additional threads

At the end of the parallel code, the additional threads created
are destroyed/suspended (joins)
OpenMP Pragmas & Directives

OpenMP supports parallelism through directive called “pragma
directive”

It always starts with “#pragma omp”

The general structure of the directive is:


Where ‘directive-name’ specifies the action to be taken

‘Clause’ (optional) specifies the behavior of the parallel execution

The directive must end with a newline
OpenMP Pragmas & Directives

“parallel” pragma:
– It precedes a block of code that should be executed parallely by all
threads
– Syntax is:
#pragma omp parallel
– If the block of code to be executed is not simple, it should be
surrounded by curly braces ({ })
– The code after the ‘parallel’ pragma will be replicated among threads
OpenMP Pragmas & Directives

“parallel for” pragma:
– It precedes a for loop that should be parallelized
– Syntax is:
#pragma omp parallel for
– Control clause of the for loop must provide info about no. of
iterations to the run-time system
– The loop index variable will be ‘private’, while other variables
will be ‘shared’
OpenMP Pragmas & Directives

“parallel for” pragma: (continued...)
– For loop is parallelized only if following conditions are met:

For loop is in canonical form

‘break’, ‘return’, ‘exit’ and ‘goto’ statements are not allowed, while
‘continue’ is allowed in for loop

The ‘index’ variable must be integer or pointer type

The ‘start’, ‘end’ and ‘incr’ must be compatible

The ‘start’, ‘end’ and ‘incr’ must not change during execution

The ‘index’ must be changed only in the ‘update’ part of for loop
OpenMP Pragmas & Directives

The canonical form of parallel for loop


OpenMP Pragmas & Directives

“critical” pragma:
– Used to indicate ‘critical section’ of a parallel program
– It immediately precedes the code that should be executed by
one thread at a time i.e., mutual exclusion
– It must be placed in the parallel section of the code
– Critical section reduces the speedup achieved
OpenMP Pragmas & Directives

“single” pragma:
– It tells the compiler that only a single thread should execute
the block of code that follows it
– It must be placed in the parallel section of the code
OpenMP Pragmas & Directives

“section” pragma:
– It is used to achieve functional parallelism
– It precedes each function call that is executed parallely by
seperate threads
– It must be specified inside the “parallel sections” pragma
OpenMP Pragmas & Directives

“parallel sections” pragma:
– It is used to achieve functional parallelism
– It precedes a block of ‘k’ blocks of code that may be
executed parallely by ‘k’ seperate threads
– The block of ‘k’ code blocks must be specified using curly
braces ({ })
OpenMP Pragmas & Directives
OpenMP Pragmas & Directives

“sections” pragma:
– It is used to achieve functional parallelism
– Functionality is similar to “parallel sections” pragma
– However, it appears inside ‘parallel’ pragma and
– It doesn’t create any new threads, instead uses threads
created by ‘parallel’ pragma
OpenMP Pragmas & Directives
OpenMP Pragmas & Directives

“private” clause:
– Tells compiler to make one or more variables as ‘private’
– Syntax: private (<variable list>)
OpenMP Pragmas & Directives

“omp_get_num_procs” function:
– Returns the number of physical processors available for use
by parallel program


“omp_get_num_threads” function:
– Returns the number of active threads in the current parallel
region
OpenMP Pragmas & Directives

“omp_set_num_threads” function:
– Used to set the number of threads to be active in the parallel
sections of code


“omp_get_thread_num” function:
– Every thread on a multiprocessor has a unique identification
number
– Used to retrieve the unique ID of a thread
Trapezoidal rule

Classic example of ‘how to create parallel programs for problems’

Problem: find the area under the curve between the 2 extremes ‘a’
and ‘b’, where a < b

Solution:
– divide the area under the curve between ‘a’ and ‘b’ into ‘n’
subintervals of equal length
– Calculate the areas of each subinterval and sum it up to get the
total area under the curve
Trapezoidal rule
Trapezoidal rule

Program link1 Program link2


Reduction clause

Reductions are those operations which uses the local results of
all threads and combines them into a single result

Usually reductions are written in ‘critical section’ which reduces
the speedup
Reduction clause

OpenMP provides a ‘reduction’ clause to specify reductions

Syntax: reduction (<operation>: <variable>)

The reduction clause must be specified with ‘parallel’ pragma
Loop-Carried Dependency

Data Dependency: condition where the computation of a data depends
on another data

Loop-Carried Dependency: condition where the data computed in an
iteration is used in subsequent iterations in a for loop

eg., finding fibonacci series, finding Pi

OpenMP compilers doesn’t check for data dependencies, programmer
has to do it

A for loop with loop-carried dependency cannot be parallelized correctly
without using features such as Tasking API
Loop-Carried Dependency
Loop-Carried Dependency

Default clause:
– It makes the programmer to declare the scope of any variables in a parallel block
– Any variable declared outside but used inside the parallel block must be
declared explicitly
Loop Scheduling

In OpenMP assigning iterations to threads is called scheduling

OpenMP by default uses block partitioning

The ‘schedule’ clause can be used to assign iterations in either a
‘parallel’ or ‘parallel for’ directive
Loop Scheduling
Loop Scheduling

‘Static’ Scheduling:
– System assigns ‘chunksize’ iterations to each thread in ‘round-robin’ fashion
– Useful when each iteration takes equal amount of time to execute
– If ‘chunksize’ is omitted, it will be equal to: total_iterations / thread_count

schedule(static,1) schedule(static,2) schedule(static,4)


Loop Scheduling
Loop Scheduling

‘dynamic’ Scheduling:
– System assigns ‘chunksize’ iterations to each thread in a ‘first-come first-served’ fashion
– When a thread finishes its chunk, it requests another one from the run-time system
– Useful when loop iterations do not take uniform amount of time to execute
– If ‘chunksize’ is omitted, it will be equal to 1
Loop Scheduling

‘guided’ Scheduling:
– It is similar to ‘dynamic’ scheduling
– However, as chunks are completed, the size of new chunks decreases
– Useful when loop iterations do not take uniform amount of time to execute
Loop Scheduling
Loop Scheduling

‘runtime’ Scheduling:
– It uses a system variable ‘OMP_SCHEDULE’ to determine at runtime
how to schedule a loop
– The environment variable can take any of the values that can be used
for static, dynamic or guided schedule
– eg. export OMP_SCHEDULE = “static, 1”
Cache coherence & False Sharing

Cache coherence: cache memories of threads storing shared variables

Cache line or block: a block of content is transfered from main memory
to cache instead of a single value

When a thread updates a cache value, the entire cache line is
invalidated in other threads

This causes the threads to read values from memory even though they
are not sharing anything

This phenomenon is called ‘false sharing’
Cache coherence & False Sharing

False sharing has a significant effect on performance of a parallel program

eg. Matrix-Vector Multiplication y = Ax, where A=(m*n), x=(n) , y=(m)
Cache coherence & False Sharing


8,000,000 * 8 takes 22% more time than 8000 * 8000 due to write-miss at Line 4
in code

8 * 8,000,000 takes 26% more time than 8000 * 8000 due to read-miss at Line 6
in code
Tasking

While and do-while loops cannot be parallelized by OpenMP

For loops with unbounded no. of iterations also cannot be
parallelized

This will limit the parallelization: cannot be applied to recursive
algorithms, graph-related algorithms, etc

Tasking functionality was created to address this issue
Tasking

It allows the developers to specify independent units of computations with the
‘task’ directive

Syntax: #pragma omp task

When this directive is reached, a new task will be created

The new task may not necessarily be executed immediately

Tasks must be launched within a ‘parallel’ region by only one thread

Hence, tasking generally looks like:

You might also like