Module -4
Shared Memory Programming with OpenMP
Course Outcome:
At the end of the course, the student will be able to apply OpenMP
pragma and directives to parallelize the code to solve the given
problem
Topic
●
Introduction
●
OpenMP Pragmas and Directives
●
Trapezoidal Rule
●
The Reduction Clause
●
Loop Carried Dependency
●
Scheduling
●
Cache Coherence & False Sharing
●
Tasking
●
Thread Safety
Introduction
●
OpenMP is an “directive-based” API for Shared-Memory MIMD
Programming
●
It allows programmers to incrementally parallelize existing serial programs
●
It allows the programmer to specify the block of code to be executed
parallely
●
However, the parallel execution of code by threads is taken care by
Compiler and Run-time System
●
It requires a C compiler that supports OpenMP
Introduction
●
Shared-memory programs uses “Fork / Join” parallelism
●
At the beginning, a thread called “Master Thread” is active and
executes serial portion of the program
●
When parallel operations are to be executed, it creates(forks)
additional threads
●
At the end of the parallel code, the additional threads created
are destroyed/suspended (joins)
OpenMP Pragmas & Directives
●
OpenMP supports parallelism through directive called “pragma
directive”
●
It always starts with “#pragma omp”
●
The general structure of the directive is:
●
Where ‘directive-name’ specifies the action to be taken
●
‘Clause’ (optional) specifies the behavior of the parallel execution
●
The directive must end with a newline
OpenMP Pragmas & Directives
●
“parallel” pragma:
– It precedes a block of code that should be executed parallely by all
threads
– Syntax is:
#pragma omp parallel
– If the block of code to be executed is not simple, it should be
surrounded by curly braces ({ })
– The code after the ‘parallel’ pragma will be replicated among threads
OpenMP Pragmas & Directives
●
“parallel for” pragma:
– It precedes a for loop that should be parallelized
– Syntax is:
#pragma omp parallel for
– Control clause of the for loop must provide info about no. of
iterations to the run-time system
– The loop index variable will be ‘private’, while other variables
will be ‘shared’
OpenMP Pragmas & Directives
●
“parallel for” pragma: (continued...)
– For loop is parallelized only if following conditions are met:
●
For loop is in canonical form
●
‘break’, ‘return’, ‘exit’ and ‘goto’ statements are not allowed, while
‘continue’ is allowed in for loop
●
The ‘index’ variable must be integer or pointer type
●
The ‘start’, ‘end’ and ‘incr’ must be compatible
●
The ‘start’, ‘end’ and ‘incr’ must not change during execution
●
The ‘index’ must be changed only in the ‘update’ part of for loop
OpenMP Pragmas & Directives
The canonical form of parallel for loop
OpenMP Pragmas & Directives
●
“critical” pragma:
– Used to indicate ‘critical section’ of a parallel program
– It immediately precedes the code that should be executed by
one thread at a time i.e., mutual exclusion
– It must be placed in the parallel section of the code
– Critical section reduces the speedup achieved
OpenMP Pragmas & Directives
●
“single” pragma:
– It tells the compiler that only a single thread should execute
the block of code that follows it
– It must be placed in the parallel section of the code
OpenMP Pragmas & Directives
●
“section” pragma:
– It is used to achieve functional parallelism
– It precedes each function call that is executed parallely by
seperate threads
– It must be specified inside the “parallel sections” pragma
OpenMP Pragmas & Directives
●
“parallel sections” pragma:
– It is used to achieve functional parallelism
– It precedes a block of ‘k’ blocks of code that may be
executed parallely by ‘k’ seperate threads
– The block of ‘k’ code blocks must be specified using curly
braces ({ })
OpenMP Pragmas & Directives
OpenMP Pragmas & Directives
●
“sections” pragma:
– It is used to achieve functional parallelism
– Functionality is similar to “parallel sections” pragma
– However, it appears inside ‘parallel’ pragma and
– It doesn’t create any new threads, instead uses threads
created by ‘parallel’ pragma
OpenMP Pragmas & Directives
OpenMP Pragmas & Directives
●
“private” clause:
– Tells compiler to make one or more variables as ‘private’
– Syntax: private (<variable list>)
OpenMP Pragmas & Directives
●
“omp_get_num_procs” function:
– Returns the number of physical processors available for use
by parallel program
●
“omp_get_num_threads” function:
– Returns the number of active threads in the current parallel
region
OpenMP Pragmas & Directives
●
“omp_set_num_threads” function:
– Used to set the number of threads to be active in the parallel
sections of code
●
“omp_get_thread_num” function:
– Every thread on a multiprocessor has a unique identification
number
– Used to retrieve the unique ID of a thread
Trapezoidal rule
●
Classic example of ‘how to create parallel programs for problems’
●
Problem: find the area under the curve between the 2 extremes ‘a’
and ‘b’, where a < b
●
Solution:
– divide the area under the curve between ‘a’ and ‘b’ into ‘n’
subintervals of equal length
– Calculate the areas of each subinterval and sum it up to get the
total area under the curve
Trapezoidal rule
Trapezoidal rule
Program link1 Program link2
Reduction clause
●
Reductions are those operations which uses the local results of
all threads and combines them into a single result
●
Usually reductions are written in ‘critical section’ which reduces
the speedup
Reduction clause
●
OpenMP provides a ‘reduction’ clause to specify reductions
●
Syntax: reduction (<operation>: <variable>)
●
The reduction clause must be specified with ‘parallel’ pragma
Loop-Carried Dependency
●
Data Dependency: condition where the computation of a data depends
on another data
●
Loop-Carried Dependency: condition where the data computed in an
iteration is used in subsequent iterations in a for loop
●
eg., finding fibonacci series, finding Pi
●
OpenMP compilers doesn’t check for data dependencies, programmer
has to do it
●
A for loop with loop-carried dependency cannot be parallelized correctly
without using features such as Tasking API
Loop-Carried Dependency
Loop-Carried Dependency
●
Default clause:
– It makes the programmer to declare the scope of any variables in a parallel block
– Any variable declared outside but used inside the parallel block must be
declared explicitly
Loop Scheduling
●
In OpenMP assigning iterations to threads is called scheduling
●
OpenMP by default uses block partitioning
●
The ‘schedule’ clause can be used to assign iterations in either a
‘parallel’ or ‘parallel for’ directive
Loop Scheduling
Loop Scheduling
●
‘Static’ Scheduling:
– System assigns ‘chunksize’ iterations to each thread in ‘round-robin’ fashion
– Useful when each iteration takes equal amount of time to execute
– If ‘chunksize’ is omitted, it will be equal to: total_iterations / thread_count
schedule(static,1) schedule(static,2) schedule(static,4)
Loop Scheduling
Loop Scheduling
●
‘dynamic’ Scheduling:
– System assigns ‘chunksize’ iterations to each thread in a ‘first-come first-served’ fashion
– When a thread finishes its chunk, it requests another one from the run-time system
– Useful when loop iterations do not take uniform amount of time to execute
– If ‘chunksize’ is omitted, it will be equal to 1
Loop Scheduling
●
‘guided’ Scheduling:
– It is similar to ‘dynamic’ scheduling
– However, as chunks are completed, the size of new chunks decreases
– Useful when loop iterations do not take uniform amount of time to execute
Loop Scheduling
Loop Scheduling
●
‘runtime’ Scheduling:
– It uses a system variable ‘OMP_SCHEDULE’ to determine at runtime
how to schedule a loop
– The environment variable can take any of the values that can be used
for static, dynamic or guided schedule
– eg. export OMP_SCHEDULE = “static, 1”
Cache coherence & False Sharing
●
Cache coherence: cache memories of threads storing shared variables
●
Cache line or block: a block of content is transfered from main memory
to cache instead of a single value
●
When a thread updates a cache value, the entire cache line is
invalidated in other threads
●
This causes the threads to read values from memory even though they
are not sharing anything
●
This phenomenon is called ‘false sharing’
Cache coherence & False Sharing
●
False sharing has a significant effect on performance of a parallel program
●
eg. Matrix-Vector Multiplication y = Ax, where A=(m*n), x=(n) , y=(m)
Cache coherence & False Sharing
●
8,000,000 * 8 takes 22% more time than 8000 * 8000 due to write-miss at Line 4
in code
●
8 * 8,000,000 takes 26% more time than 8000 * 8000 due to read-miss at Line 6
in code
Tasking
●
While and do-while loops cannot be parallelized by OpenMP
●
For loops with unbounded no. of iterations also cannot be
parallelized
●
This will limit the parallelization: cannot be applied to recursive
algorithms, graph-related algorithms, etc
●
Tasking functionality was created to address this issue
Tasking
●
It allows the developers to specify independent units of computations with the
‘task’ directive
●
Syntax: #pragma omp task
●
When this directive is reached, a new task will be created
●
The new task may not necessarily be executed immediately
●
Tasks must be launched within a ‘parallel’ region by only one thread
●
Hence, tasking generally looks like: