0% found this document useful (0 votes)
10 views

Multi Threading

The document discusses multithreaded algorithms, focusing on parallel computing models with shared memory and the concepts of threads, spawning, and synchronization. It explains the Fibonacci sequence computation using a naive algorithm and how it can be optimized through parallelism with dynamic multithreading. Additionally, it covers performance measures such as work, span, speedup, and slackness in the context of multithreaded execution.

Uploaded by

ruthmp.cs22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Multi Threading

The document discusses multithreaded algorithms, focusing on parallel computing models with shared memory and the concepts of threads, spawning, and synchronization. It explains the Fibonacci sequence computation using a naive algorithm and how it can be optimized through parallelism with dynamic multithreading. Additionally, it covers performance measures such as work, span, speedup, and slackness in the context of multithreaded execution.

Uploaded by

ruthmp.cs22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Multithreaded

Algorithms
Motivatio
n

Serial algorithms are suitable for running on


a uniprocessor computer.
We will now extend our model to parallel algorithms
that can run on a multiprocessor computer.
Computational
Model

There exist many competing models of parallel


computation that are essentially different. For
example, one can have shared or distributed
memory.
Since multicore processors are ubiquitous, we focus
on a parallel computing model with shared memory.
Thread
s
• A thread in computer science is short for a thread of
execution.

• A thread of execution is the smallest sequence of


programmed instructions that can be managed
independently by a scheduler, which is typically a part of
the operating system.
• Threads are a way for a program to divide (termed "split") itself
into two or more simultaneously (or pseudo-simultaneously)
running tasks.
• The implementation of threads and processes differs between
operating systems, but in most cases a thread is a component of a
process. Multiple threads can exist within one process, executing
concurrently and sharing resources such as memory, while different
processes do not share these resources.
Threading
Types
• Two types feasible:
• Static threading: OS controls, typically for single-core CPU’s.
but multi-core CPU’s use it if compiler guarantees safe execution
• Dynamic threading: Program controls explicitly, threads are
created/destroyed as needed, parallel computing model

Threads allow concurrent execution of two or more parts of a program for


maximum utilization of CPU.
Dynamic
Multithreading
Programming a shared-memory parallel computer
can be difficult and error-prone. In particular, it is
difficult to partition the work among several threads
so that each thread approximately has the same
load.
A concurrency platform is a software layer that
coordinates, schedules, and manages parallel-
computing resources. We will use a simple
extension of the serial programming model that
uses the concurrency instructions parallel,
spawn, and sync.
Spaw
n
Spawn: If spawn proceeds a procedure call, then the
procedure instance that executes the spawn (the
parent) may continue to execute in parallel with the
spawned subroutine (the child), instead of waiting
for the child to complete.
The keyword spawn does not say that a procedure
must execute concurrently, but simply that it may.
At runtime, it is up to the scheduler to decide
which subcomputations should run concurrently.
Syn
c

The keyword sync indicates that the procedure


must wait for all its spawned children to
complete.
Paralle
l

Many algorithms contain loops, where all iterations


can operate in parallel. If the parallel keyword
proceeds a for loop, then this indicates that the
loop body can be executed in parallel.
openMP – Uses task
schedular to create threads
#include <stdio.h>
#include <omp.h>

int main() {
int I , n=5;
int array[5] ={0,1,4,9,16};
#pragma omp parallel for
for (i = 0; i < n; i++) {
printf("Thread processes index %d\n", i);
array[i] = array[i] * 2; // Example operation
printf("array[%d] = %d\n", i, array[i]);
}
return 0;
}
Fibonacci
Numbers
Definitio
n

The Fibonacci numbers (0,1,1,2,3,5,8,13…) are


defined by the recurrence:
F0 = 0

F1 = 1

Fi = Fi-1 + Fi-2

for i > 1.
Naive
Algorithm
Computing the Fibonacci numbers can be done
with the following algorithm:

Fibonacci(n)
if n < 2 then return
n; x =
Fibonacci(n-1); y
= Fibonacci(n-2) ;
return x + y;
Running
Time

Let T(n) denote the running time of Fibonacci(n). Since this


procedure contains two recursive calls and a constant
amount of extra work, we get
T(n) = T(n-1) + T(n-2) + θ(1)
which yields T(n) = θ(Fn)= θ( ((1+sqrt(5))/2)n )

Since this is an exponential growth, this is a particularly bad


way to calculate Fibonacci numbers.
How would you calculate the Fibonacci numbers?
Fibonacci
Example
Observe that within FIB(n), the two recursive calls in lines 3 and 4 to

FIB(n-1) and FIB(n-2), respectively, are independent of each other: they


could be called in either order, and the computation performed by one
has no way affects the other.
Therefore, the two recursive calls can run in parallel.
Fibonacci
Example
Parallel algorithm to compute Fibonacci numbers:
We augment our pseudocode to indicate parallelism by
adding the concurrency keywords spawn and sync.
Here is how we can rewrite the FIB procedure to use
dynamic multithreading:
Spawn, Sync &
Parallel

Notice that if we delete the concurrency keywords spawn and


sync from P-FIB,the resulting pseudocode text is identical to
FIB (other than renaming the procedurein the header and in
the two recursive calls).
We define the serialization of a multithreaded algorithm to
be the serial algorithm that results from deleting the
multithreaded keywords:
spawn, sync, and parallel.
Spaw
n
Nested parallelism occurs when the keyword spawn
precedes a procedure call, as in line 3. It creates a
concurrent process.
The semantics of a spawn differs from an ordinary procedure
call in that the procedure instance that executes the spawn
—the parent—may continue to execute in parallel with
the spawned subroutine—its child—instead of waiting
for the child to complete, as would normally happen in a
serial execution.
Spaw
n
In this case, while the spawned child is computing P-FIB(n-1),
the parent may go on to compute P-FIB(n-2)in line 4 in
parallel with the spawned child.
Since the P-FIB procedure is recursive, these two subroutine
calls themselves create nested parallelism, as do their
children, thereby creating a potentially vast tree of
subcomputations, all executing in parallel.
Spaw
n

The keyword spawn does not say, however, that a


procedure must execute concurrently with its spawned
children, only that it may.
The concurrency keywords express the logical parallelism
of the computation, indicating which parts of the
computation may proceed in parallel.
At runtime, it is up to a scheduler to determine which
subcomputations actually run concurrently by assigning
them to available processors as the computation unfolds.
Syn
c
A procedure cannot safely use the values returned by its
spawned children until after it executes a sync
statement, as in line 5. The keyword sync indicates
that the procedure must wait as necessary for all its
spawned children to complete execution before proceeding
to the statement after the sync.
In the P-FIB procedure, a sync is required before the
return statement in line 6 to avoid the anomaly that
would occur if x and y were summed before x was
computed.
In addition to explicit synchronization provided by the sync
statement, every procedure executes a sync implicitly
before it returns, thus ensuring that all its children terminate
before it does.
A model for multithreaded
execution
It helps to think of a multithreaded computation—the set
of runtime instructions executed by a processor on behalf
of a multithreaded program—as a directed acyclic graph G
=(V,E), called a computation dag.
Computation
DAG
Multithreaded computation can be better understood with
the help of a computation directed acyclic graph
G=(V,E).
The vertices V in the graph are the instructions.
The edges E represent dependencies between
instructions.
An edge (u,v) is in E means that the instruction u must
execute before instruction v.
Strand and
Threads
A sequence of instructions containing no parallel
control (spawn, sync, return from spawn, parallel)
can be grouped into a single strand.

Thus, V represents a set of strands and E the


dependencies between the strands introduced by a
parallel control.

A strand of maximal length will be called a thread.


Computation
DAG
A computation directed acyclic graph G=(V,E) consists a
vertex set V that comprises the threads of the program.
The edge set E contains an edge (u,v) if and only if the
thread u need to execute before thread v.

If there is an edge between thread u and v, then they are


said to be (logically) in series. If there is no edge, then
they are said to be (logically) in parallel.
Edge
Classification
A continuation edge (u,v) connects a thread u to its
successor v within the same procedure instance.
When a thread u spawns a new thread v, then (u,v) is
called a spawn edge.
When a thread v returns to its calling procedure and x is
the thread following the parallel control, then the return
edge (v,x) is included in the graph.
Fibonacci
Example
Parallel algorithm to compute Fibonacci
numbers:
Fibonacci(4
)
•Each circle represents one strand (a chain of instructions which contains no
parallel control).
•Black dots : base case or part of the procedure up to the spawn of P-FIN(n-1) in
line 3.
•Grey dots: regular execution ie the part of the procedure that calls P-FIN(n-2) in
line 4 up to the sync in line 5.
•White dots: part of the procedure after sync up to the point where it returns the
result.
Performance
Measures
DAG: directed acyclic graph. Vertices are the circles for spawn,
sync or procedure call. For a problem of size n:

• Span S or T∞(n). Number of vertices on the longest directed path


from start to finish in the computation DAG. (The critical path).

The run time if each vertex of the DAG has its own processor.
• Work W or T1 (n). Total time to execute the entire computation on
one processor. Defined as the number of vertices in the
computation DAG
• Tp(n). Total time to execute entire computation with p processors

• Speed up = T1/Tp. How much faster it is.

• Parallelism = T1/ T∞ . The maximum possible speed up.


Performance
Measures
The work of a multithreaded computation is the
total time to execute the entire computation on
one processor.
Work = sum of the times taken by each thread
= 17 time units
Performance
Measures
The span is the longest time to execute the
strands along any path of the computational
directed acyclic graph.
Span =the number of vertices on a longest or
critical path
span = 8 time units
Performance Measure
Example
In Fibonacci(4), we have
17 vertices = 17
threads. 8 vertices on
longest path.

Assuming unit time for


each thread, we get
work = 17 time
units span = 8
time units
The actual running time of a multithreaded
computation depends not just on its work and
span, but also on how many processors (cores)
are available, and how the scheduler allocates
strands to processors.
Running time on P processors is indicated by
subscript P
- T1 running time on a single processor

- TP running time on P processors

- T∞ running time on unlimited processors, also


Work
Law

An ideal parallel computer with P processors can do


at most P units of work, and thus in Tp time, it can
perform at most PTp work.
we have PTp >= T1
Dividing by P yields the work law: Tp >= T1/P
Span
Law
A P-processor ideal parallel computer cannot run
faster than a machine with unlimited number of
processors.
However, a computer with unlimited number of
processors can emulate a P-processor machine by
using just P of its processors. Therefore,
Tp >= T∞ (when there are p processors , execution time is greater than or equal to

the execution time taken when there are unlimited processors)

which is called the span law.


- TP running time on P processors

- T∞ running time on unlimited processors


Span Law Explanation:
Why is this true?
Unlimited processors: If you have unlimited
processors, you could ideally execute all operations in
parallel. The only limitation is the critical path length
(the longest sequence of dependent operations).
Hence, the time to finish the algorithm in the best
case is T∞​.
P processors: On a machine with only P processors,
you can't exploit parallelism as fully because there
are fewer processors to distribute the work. Some
operations will inevitably have to wait for others, so
the execution time will generally be greater than or
equal to the time it would take on an unlimited
number of processors.
Speedup and
Parallelism

The speed up of a computation on P processors is


defined as T1 / Tp

i.e. how many times faster the computations on P


processors than on 1 processor (How much faster
it is).
Thus, speedup on P processors can be at most P.
PTp >= T1
Speedup and
Parallelism
The parallelism (max possible speed up). of a
multithreaded computation is given by T1 / T∞

We can view the parallelism from three perspectives.


□As a ratio, the parallelism denotes the average amount of
work that can be performed in parallel for each step along
the critical path.
□As an upper bound, the parallelism gives the maximum
possible speedup that can be achieved on any number of
processors.
□Finally, and perhaps most important, the parallelism
provides a limit on the possibility of attaining perfect
linear speedup. Specifically, once the number of
processors exceeds the parallelism, the computation
cannot possibly achieve perfect linear speedup.
Speedup and
Parallelism

Consider the computation P-FIB(4) and assume that each


strand takes unit time.

Since the work is T1 = 17 and the span is T∞

= 8, the parallelism is T1/T∞ = 17/8 =

2.125.
Consequently, achieving much more than double the
speedup is impossible, no matter how many processors we
employ to execute the computation.
Schedulin
g
The performance depends not just on the work and
span. Additionally, the strands must be scheduled
efficiently onto the processors of the parallel
machines.
The strands must be mapped to static threads, and
the operating system schedules the threads on
the processors themselves.
The scheduler must schedule the computation with
no advance knowledge of when the strands will be
spawned or when they will complete; it must
operate online.
Greedy
Scheduler
We will assume a greedy scheduler in our analysis,
since this keeps things simple. A greedy scheduler
assigns as many strands to processors as possible
in each time step.
On P processors, if at least P strands are ready to
execute during a time step, then we say that the
step is a complete step; otherwise we say that it is
an incomplete step.
Greedy Scheduler
Theorem

On an ideal parallel computer with P processors, a


greedy scheduler executes a multithreaded
computation with work T1 and span T∞ in time

TP <= T1 / P + T∞

[Given the fact the best we can hope for on P processors is


TP = T1 / P by the work law, and TP = T∞ by the span law,
the sum of these two bounds ]
Slacknes
s
The parallel slackness of a multithreaded
computation executed on an ideal parallel
computer with P processors is the ratio of
parallelism by P.

Slackness = (T1 / T∞) / P


SLACKNESS

Slackness in the context of parallel computing and


scheduling refers to the amount of unused
time or idle time that a processor has while waiting
for tasks to be assigned.
This refers to the difference between the time
available and the time actually required to complete
a task.
Speedu
p

Let TP be the running time of a multithreaded


computation produced by a greedy scheduler on an
ideal computer with P processors. Let T1 be the work
and T∞ be the span of the
computation. If the slackness is big, P << (T1 / T∞),
then
TP is approximately T1 / P.
Back to
Fibonacci
Parallel Fibonacci
Computation
Parallel algorithm to compute Fibonacci
numbers:
Fibonacci(n)
if n < 2 then return n;
x = spawn // parallel
Fibonacci(n-1); execution
y = spawn Fibonacci(n-2) ; // parallel
execution sync; // wait for results
of x and y
return x + y;
Work of
Fibonacci
We want to know the work and span of the
Fibonacci computation, so that we can compute
the parallelism (work/span) of the computation.

The work T1 is straightforward, since it amounts to


compute the running time of the serialized
algorithm.
T1 = θ( ((1+sqrt(5))/2)n )
Span of
Fibonacci

Recall that the span T∞ in the longest path in the


DAG. Since Fibonacci(n) spawns
computational
• Fibonacci(n-1)
• Fibonacci(n-2)
we have
T∞(n) = max( T∞(n-1) , T∞(n-2) ) + θ(1) = T∞(n-1)
+ θ(1)
which yields T∞(n) = θ(n).
Parallelism of
Fibonacci
The parallelism of the Fibonacci computation is

T1(n)/T∞(n) = θ( ((1+sqrt(5))/2)n / n)

which grows dramatically as n gets


large.

Therefore, even on the largest parallel computers, a modest


value of n suffices to achieve near perfect linear speedup,
since we have considerable parallel slackness.
Parallel
loops
Parallel
loops

Many algorithms contain loops all of whose iterations can


operate in parallel.
We can parallelize such loops using the spawn and sync
keywords, but it is much more convenient to specify
directly that the iterations of such loops can run
concurrently.
The pseudocode provides this functionality via the
parallel concurrency keyword, which precedes the for
keyword in a for loop statement.
Parallel
loops
Parallel
loops
The parallel for keywords in lines 3 and 5 indicate
that the iterations of the respective loops may be
run concurrently.
A compiler can implement each parallel for loop as
a divide-and-conquer subroutine using nested
parallelism.
For example, the parallel for loop in lines 5–7 can be
implemented with the call MAT-VEC-MAIN-LOOP(A, x, y, n,
1, n)
Parallel
loops
Race
Conditions
Race
Conditions

A multithreaded algorithm is deterministic if and only


if it does the same thing on the same input, no
matter how the instructions are scheduled.
A multithreaded algorithm is nondeterministic if
its behavior might vary from run to run.

Often, a multithreaded algorithm that is intended to


be deterministic fails to be.
Determinacy
Race
A determinacy race occurs when two logically
parallel instructions access the same memory
location and at least one of the instructions
performs a write.
Race-
Example() x
=0
parallel for i
= 1 to 2 do
x
=
Determinacy
Race

When a processor increments x, the operation is not


indivisible, but composed of a sequence of instructions.
1)Read x from memory into one of the processor’s
registers
2) Increment the value of the register
3) Write the value in the register back into x in memory
Determinacy
Race
x=0
assign r1 = 0
incr r1, so r1=1
assign r2 = 0
incr r2, so r2 =
1 write back x
= r1
write back x =
r2
print x // now
prints 1 instead
Determinacy
Race

If the effect of the parallel execution were that processor


1 executed all its instructions before processor 2, the
value 2 would be printed.
Conversely, if the effect were that processor 2 executed all
its instructions before processor 1, the value 2 would still
be printed.
When the instructions of the two processors execute at the
same time, however, it is possible, as in this example
execution, that one of the updates to x is lost.
Determinacy
Race
Generally, most orderings produce correct . But some orderings
generate improper results when the instructions interleave.
Consequently, races can be extremely hard to test for. You
can run tests for days and never see the bug, only to
experience a catastrophic system crash in the field when
the outcome is critical.
Although we can cope with races in a variety of ways, including
using mutual exclusion locks and other methods of
synchronization, for our purposes, we shall simply ensure that
strands that operate in parallel are independent: they have
no determinacy races among them.
Thus, in a parallel for construct, all the iterations
should be independent.
Between a spawn and the corresponding sync, the code of the
spawned child should be independent of the code of the parent,
Matrix
Multiplication
Matrix Multiplication:
Naïve Method
void multiply(int A[][N], int B[][N], int C[]
[N])
{
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
C[i][j] = 0;
for (int k = 0; k < N; k++)
{
C[i][j] += A[i][k]*B[k][j];
}
}
}
}
Matrix Multiplication: Divide and
Conquer
Following is simple Divide and Conquer method to multiply two square
matrices.
1)Divide matrices A and B in 4 sub-matrices of size N/2 x N/2 as shown in
the below diagram.
2) Calculate following values recursively. ae + bg, af + bh, ce + dg and
cf + dh.

In the above method, we do 8 multiplications for matrices of size N/2 x


N/2 and 4 additions. Addition of two matrices takes O(N2) time. So the
time complexity can be written as

T(N) = 8T(N/2) + O(N2) From Master's Theorem, time complexity of


above method is O(N3) which is unfortunately same as the above
naive method.
Matrix Multiplication: Divide and
Conquer
Strassen’s Matrix Multiplication
Method

-In divide and conquer method, the main


component for high time complexity is 8
recursive calls.
-The idea of Strassen’s method is to reduce the
number of recursive calls to 7.
-Strassen’s method is similar to simple divide and
conquer method in the sense that this method
also divide matrices to sub-matrices of size N/2 x
N/2 as shown, but in Strassen’s method, the four
sub-matrices of result are calculated using
following formulae.
Strassen’s Matrix Multiplication
Method

Thus, to multiply two 2 × 2 matrices, Strassen’s algorithm makes seven


multiplications and 18 additions/subtractions, whereas the
normal algorithm requires eight multiplications and four additions.
Matrix
Multiplication

One can multiply nxn matrices serially in time


θ( nlog 7) = O( n2.81) using Strassen’s divide-and-
conquer method.

We will use multithreading for a


simpler divide-and-conquer
algorithm.
Simple Divide-and-
Conquer

To multiply two nxn matrices, we perform 8 matrix


multiplications of n/2 x n/2 matrices and one addition of
n x n matrices.
Matrix
Multiplication
Matrix-Multiply(C, A, B, n):
// Multiplies matrices A and B, storing the result in C.
// n is power of 2 (for simplicity).
if n == 1:
C[1, 1] = A[1, 1] · B[1, 1]
else:
allocate a temporary matrix
T[1...n, 1...n]
partition A, B, C, and T into (n/2)x(n/2) submatrices
spawn Matrix-Multiply(C11,A11,B11, n/2)
spawn Matrix-Multiply(C12,A11,B12, n/2)
spawn Matrix-Multiply(C21,A21,B11,
n/2) spawn Matrix-
Multiply(C22,A21,B12, n/2) spawn
Matrix-Multiply(T11,A12,B21, n/2)
spawn Matrix-Multiply(T12,A12,B22, n/2)
spawn Matrix-Multiply(T21,A22,B21,
Addition of
Matrices
Matrix-Add(C, T, n):
// Adds matrices C and T in-place, producing C = C + T
// n is power of 2 (for simplicity).
if n == 1:
C[1, 1] = C[1, 1] + T[1, 1]
else:
partition C and T into (n/2)x(n/2)
submatrices
spawn Matrix-Add(C11, T11, n/2)
spawn Matrix-Add(C12, T12, n/2)
spawn Matrix-Add(C21, T21, n/2)
spawn Matrix-Add(C22, T22, n/2)
sync
Work of Matrix
Multiplication
The work T1(n) of matrix multiplication satisfies the

recurrence T1(n) = 8 T1(n/2) + θ(n2) = θ(n3)

by case 1 of the Master theorem.

The parallelism (max possible speed up). of a


multithreaded computation is given by T1 / T∞
Span of Matrix
Multiplication
The span T (n) of matrix multiplication is determined by

- the span for partitioning θ(1)


- the span of the parallel nested for loops at the end
θ(log n)
-the maximum span of the 8 matrix

multiplications T∞(n) = T∞(n/2) + θ(log n)

Solving this recurrence we

get T∞(n) = θ((log n)2)

The parallelism of matrix multiplication is

given by T1(n) / T∞(n) = θ(n3 / (log n)2 )


Merge
Sort
Merge Sort- Serial
version
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
P-MERGE procedure assumes that the two subarrays to be merged lie
within the same array.
P-MERGE takes as an argument an output subarray A into which the merged
values should be stored.
The call P-MERGE(T, p1, r1, p2, r2,A, p3) merges the sorted subarrays T[p1
..r1] and T[p2 .. r2] into the subarray A[p3 .. r3], where r3=p3 +(r1 -p1 + 1)
+(r2- p2+ 1)- 1 = p3+(r1- p1)+(r2 - p2)+ 1 and is not provided as an input.
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort
Multithreaded Merge
Sort

Parallelism:
Acknowledgeme
nts
https://www.youtube.com/watch?v=UaCX8Iy00DA

https://www.youtube.com/watch?v=VD8hY7kWjdc
https://www.youtube.com/watch?v=7T-gjX24FR0
https://www.slideshare.net/AndresMendezVazquez/24-multith
re
aded-algorithms
• https://homes.luddy.indiana.edu/achauhan/Teaching/B403/L
ect ureNotes/11-multithreaded.html
• https://catonmat.net/mit-introduction-to-algorithms-part-thi
rt een
• https://www.youtube.com/watch?v=iFrmLRr9ke0
• https://www.youtube.com/watch?v=GvtgV2NkdVg&t=31s
• https://www.youtube.com/watch?v=_XOZ2IiP2nw
• Analysis of merge sort :
https://www.youtube.com/watch?v=0nlPxaC
2lTw

You might also like