0% found this document useful (0 votes)

19 views

Tutorial 4

Uploaded by

1378311976dcr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Tutorial 4

Uploaded by

1378311976dcr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

COMP 4007: Parallel Processing and Computer Architecture

Tutorial 4:
Hybrid Parallel Programming Models
TA: Hucheng Liu ([email protected])
Contents
• Part 1: MPI + OpenMP

• Part 2: MPI + CUDA

Part 1: MPI + OpenMP
MPI+ OpenMP: Motivation
• Two-level Parallelization Node0 Node0
• Mimics hardware layout of cluster Process Process
MPI
• MPI between nodes or CPU sockets T0 T1 T0 T1
• OpenMP within shared-memory nodes or processors T2 T3 T2 T3

• Pros OpenMP OpenMP

• No message passing inside of the shared-memory MPI MPI

processor (SMP) nodes Node0 Node0
• No topology problem Process Process
T0 T1 T0 T1
• Cons MPI
T2 T3 T2 T3
• Should be careful with sleeping threads
OpenMP OpenMP
• Not always better than pure MPI or OpenMP
MPI Rules with OpenMP
• Special MPI init for multi-threaded MPI processes:
int MPI_Init_thread( int* argc, char** argv[],
int thread_level_required,
int* thead_level_provided);
int MPI_Query_thread( int* thread_level_provided);
int MPI_Is_main_thread(int* flag);

• thread_level_required specifies the requested level of thread support.

• Actual level of support is then returned into thread_level_provided.
Four Options for Thread Support
• MPI_THREAD_SINGLE
• Only one thread will execute, EQUALS to MPI_Init
• MPI_THREAD_FUNNELED
• Only master thread will make MPI-calls
• MPI_THREAD_SERIALIZED
• Multiple threads may make MPI-calls, but only one at a time
• MPI_THREAD_MULTIPLE
• Multiple threads may call MPI with no restrictions
• In most cases MPI_THREAD_FUNNELED provides the best choice for
hybrid programs
Hybrid Hello
• mpi_omp_hello.c
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);

#pragma omp parallel default(shared) private(iam, np)

{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of
%d on %s\n", iam, np, rank, numprocs, processor_name);
}
Hybrid Array Sum: Funneled MPI calls
• mpi_omp_SumArray.c: Process 0
#pragma omp parallel
{
if (pid == 0) {
…
#pragma omp master
{
for (int i = 1; i < np; i++) {
MPI_Send(&elements_per_process, …);
MPI_Send(&a[i * elements_per_process…);
}
}
#pragma omp barrier
#pragma omp for reduction(+:local_sum)
for (int i = 0; i < elements_per_process; i++)
local_sum += a[i];
}
…
}
Hybrid Array Sum
• mpi_omp_SumArray.c: Other Processes
#pragma omp parallel
{
…
else {
#pragma omp master
{
MPI_Recv(&n_elements_recieved, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(a2, n_elements_recieved, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
#pragma omp barrier
#pragma omp for reduction(+:local_sum)
for (int i = 0; i < n_elements_recieved; i++)
local_sum += a2[i];
}
}
Hybrid Array Sum
• mpi_omp_SumArray.c: All Processes
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
Environment Setup
• Setup SSH passwordless login between nodes
• Refer to lab3 slides

• Check & Install OpenMPI, OpenMP if not

Compilation
• OpenMPI wrapper script with OpenMP -fopenmp switch
• mpic++ -fopenmp -o mpi_omp_hello mpi_omp_hello.c
• mpic++ -fopenmp -o mpi_omp_SumArray mpi_omp_SumArray.c
Execution
• Nearly same with pure MPI
• With default thread num in OMP sections
• mpiexec -hostfile hostfile ./mpi_omp_hello
• Specify OMP_NUM_THREADS
• mpiexec -hostfile hostfile -x OMP_NUM_THREADS=3 ./mpi_omp_hello
• -x: Export an environment variable to the remote nodes before executing the
program, optionally specifying a value
• Specify OMP_NUM_THREADS for different hosts
• mpiexec -n 1 --host csl2wk01 -x OMP_NUM_THREADS=3 ./mpi_omp_hello : -n
2 --host csl2wk02:2 -x OMP_NUM_THREADS=2 ./mpi_omp_hello
Practice
• Implement the code of vector addition using MPI and OpenMP
• Sample code： ./practice/mpi_openmp/vector_addition.c
• Solution: ./practice/mpi_openmp/vector_addition_solution.c
Part 2: MPI + CUDA
Hybrid CUDA and MPI: Motivation
• MPI is easy to exchange data located at different processors
• CPU <-> CPU: Traditional MPI
• GPU <-> GPU: CUDA-Aware MPI
• MPI+CUDA makes the application run more efficiently
• All operations that are required to carry out the message transfer can be pipelined
• Acceleration technologies like GPUDirect can be utilized by the MPI library transparently to the
user.
Unified Virtual Addressing (UVA)
• No UVA: Separate Address Spaces vs. UVA

• UVA: One address space for all CPU and GPU memory
• Determine physical memory location from a pointer value
• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
• Supported on devices with compute capability 2.0
UVA Data Exchange with MPI
Example: Matrix Multiplication
• The root process generates two random matrices of input size and
stores them in a 1-D array in Row-major order.
• The first matrix is divided into columns depending on the number of
input processors and each part is sent to a separate GPU
(MPI_Scatter)
• The second matrix (Matrix B) is broadcasted to all nodes and copied
on all GPUs to perform computation. (MPI_Bcast)
• Each GPU computes its own part of the result matrix and sends the
result back to the root process
• Results are gathered into a resultant matrix. (MPI_Gather)
Code
• Without UVA. Send the data in the host memory.
• matvec.cu

• With UVA. Send the data in the device memory.

• matvec_uva.cu
matvec.cu(Without UVA)
• 1. Generate the data in the master process:
• Status = IntializingMatrixVectors(&MatrixA, &MatrixB, &ResultVector,
RowsNo, ColsNo, RowsNo2, ColsNo2);

• 2. Send data to different processes in host memory:

• MPI_Bcast(MatrixB, matrixBsize, MPI_FLOAT, 0, MPI_COMM_WORLD);
• MPI_Scatter(MatrixA, ScatterSize * ColsNo, MPI_FLOAT, MyMatrixA,
ScatterSize * ColsNo, MPI_FLOAT, 0, MPI_COMM_WORLD);
matvec.cu(Without UVA)
• 3. Allocate the memory in the device memory in each process:
• cudaMalloc( (void **)&DeviceMyMatrixA, ScatterSize * ColsNo * sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMatrixB, matrixBsize*sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMyResultVector, elements * sizeof(float) ) );

• 4. Copy the Data from host to device in each process:

• cudaMemcpy( (void *)DeviceMyMatrixA, (void *)MyMatrixA, ScatterSize * ColsNo * sizeof(float),
cudaMemcpyHostToDevice );
• cudaMemcpy( (void *)DeviceMatrixB, (void *)MatrixB, matrixBsize*sizeof(float),
cudaMemcpyHostToDevice );

• 5. Do the calculation in each process:

• MatrixVectorMultiplication<<<1, 256>>>(DeviceMyMatrixA, DeviceMatrixB,
DeviceMyResultVector, RowsNo, ColsNo, RowsNo2, ColsNo2, ColsNo, ScatterSize, BLOCKSIZE,
MyRank, NumberOfProcessors);
matvec.cu(Without UVA)
• 6. Copy the result from device to host in each process :
• cudaMemcpy( (void *)MyResultMatrix, (void *)DeviceMyResultVector,
elements * sizeof(float), cudaMemcpyDeviceToHost );

• 7. Gather the result:

• MPI_Gather(MyResultMatrix,elements, MPI_FLOAT, ResultVector,
elements, MPI_FLOAT, 0, MPI_COMM_WORLD);
matvec_uva.cu(With UVA)
• 1. Generate the data in the master process:
• Status = IntializingMatrixVectors(&MatrixA, &MatrixB, &ResultVector,
RowsNo, ColsNo, RowsNo2, ColsNo2);

• 2. Allocate the memory on the device memory in the master process:

• cudaMalloc( (void **)&DeviceRootMatrixA, RowsNo * ColsNo *
sizeof(float) ) ;
• cudaMalloc( (void **)&DeviceRootResultVector, RowsNo * ColsNo2 *
sizeof(float) ) ;

• 3. Copy the Data from host to device in the master process :

• cudaMemcpy( (void *)DeviceRootMatrixA, (void *)MatrixA, RowsNo *
ColsNo * sizeof(float), cudaMemcpyHostToDevice );
matvec_uva.cu(With UVA)

• 4. Allocating the memory in the device memory in each process:

• cudaMalloc( (void **)&DeviceMyMatrixA, ScatterSize * ColsNo *
sizeof(float) ) ;
• cudaMalloc( (void **)&DeviceMatrixB, matrixBsize*sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMyResultVector, elements * sizeof(float) ) ;

• 5. Send data to different processes in device memory:

• MPI_Bcast(DeviceMatrixB, matrixBsize, MPI_FLOAT, 0,
MPI_COMM_WORLD);
• MPI_Scatter(DeviceRootMatrixA, ScatterSize * ColsNo, MPI_FLOAT,
DeviceMyMatrixA, ScatterSize * ColsNo, MPI_FLOAT, 0,
MPI_COMM_WORLD);
matvec_uva.cu(With UVA)
• 6. Do the calculation in each process:
• MatrixVectorMultiplication<<<1, 256>>>(DeviceMyMatrixA, DeviceMatrixB,
DeviceMyResultVector, RowsNo, ColsNo, RowsNo2, ColsNo2, ColsNo,
ScatterSize, BLOCKSIZE, MyRank, NumberOfProcessors);

• 7. Gather the result in the device memory in the master process:

• MPI_Gather(DeviceMyResultVector, elements, MPI_FLOAT,
DeviceRootResultVector, elements, MPI_FLOAT, 0, MPI_COMM_WORLD);

• 8. Copy the result from device to host in the master process :

• cudaMemcpy( (void *)ResultVector, (void *)DeviceRootResultVector,
RowsNo * ColsNo2 * sizeof(float), cudaMemcpyDeviceToHost );
Environment Setup
• CUDA 11 and OpenMP 3.0
• setenv PATH "${PATH}:/usr/local/cuda-11/bin/”
Compilation
• 1. Put both MPI and CUDA code in a single file, matvec.cu.
• This program can be compiled using nvcc, which internally uses
gcc/g++ to compile the C/C++ code, and linked to MPI library:
• /usr/local/cuda/bin/nvcc -Xcompiler -g -w -I.. -I
/usr/local/software/openmpi/include/ -L
/usr/local/software/openmpi/lib –lmpi matvec.cu -o newfloatmatvec
Compilation
• 2. Have MPI and CUDA code separate in two files: main.c and
multiply.cu respectively. These two files can be compiled using mpicc,
and nvcc respectively into object files (.o) and combined into a single
executable file using mpicc.
• 3. This third option is an opposite compilation of the first one,
using mpicc, meaning that you have to link to your CUDA library.
Execution
• Use mpiexec. If compiled with nvcc, include the OpenMPI lib path in
LD_LIBRARY_PATH (if OpenMPI is not installed in the default path)
• mpiexec --host csl2wk26:1,csl2wk25:1 -x
LD_LIBRARY_PATH=/usr/local/software/openmpi/lib:$LD_LIBRARY_PATH
./newfloatmatvec 4 3 3 4 -p -v
Practice
• Implement the code of vector addition using MPI and CUDA
• Without UVA
• Sample code: ./practice/mpi_cuda/vector_addition.cu
• Solution: ./practice/mpi_cuda/vector_addition_solution.cu
• With UVA
• Sample code: ./practice/mpi_cuda/ vector_addition_uva.cu
• Solution: ./practice/mpi_cuda/ vector_addition_uva.cu
Reference commands: run_lab4.sh
ompi_info | grep -i thread

https://www.open-mpi.org/faq/?category=runcuda
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

Nutanix Premium NCP-EUC 77q-DEMO
100% (1)
Nutanix Premium NCP-EUC 77q-DEMO
39 pages
Acer Aspire 7 A715-75G Compal FH5VF LA-J861P R1a
100% (1)
Acer Aspire 7 A715-75G Compal FH5VF LA-J861P R1a
112 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
Advanced OpenACC Course Lecture2 Multi GPU 20160602
No ratings yet
Advanced OpenACC Course Lecture2 Multi GPU 20160602
91 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
U1 Programa4 S12021
No ratings yet
U1 Programa4 S12021
6 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Pdcnotes
No ratings yet
Pdcnotes
23 pages
HPC MPI LAB 1 Vector Addition
No ratings yet
HPC MPI LAB 1 Vector Addition
9 pages
Sunil Kumar L 24
No ratings yet
Sunil Kumar L 24
21 pages
PDC Lab 2-5
No ratings yet
PDC Lab 2-5
5 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
Lammps Overdrive
No ratings yet
Lammps Overdrive
28 pages
Master in High Performance Computing Advanced Parallel Programming LABS
No ratings yet
Master in High Performance Computing Advanced Parallel Programming LABS
2 pages
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
Parallel Computing Lab Manual PDF
No ratings yet
Parallel Computing Lab Manual PDF
51 pages
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
No ratings yet
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
292 pages
Mit Openmp Mpi
No ratings yet
Mit Openmp Mpi
77 pages
MPI_tutorial_Fall_Break_2022
No ratings yet
MPI_tutorial_Fall_Break_2022
60 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
EXERCISE- 4[1] (1)
No ratings yet
EXERCISE- 4[1] (1)
8 pages
Untitled document
No ratings yet
Untitled document
23 pages
MAP lab completed doc
No ratings yet
MAP lab completed doc
29 pages
PDCLabMan Updated
No ratings yet
PDCLabMan Updated
46 pages
Raspberry Pi - OpenMP C++ Tutorial IUB
No ratings yet
Raspberry Pi - OpenMP C++ Tutorial IUB
16 pages
Pseudo Code of Mpi Programs
No ratings yet
Pseudo Code of Mpi Programs
22 pages
Gauss
No ratings yet
Gauss
7 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Rifat
No ratings yet
Rifat
26 pages
Mpi Openmp Examples
No ratings yet
Mpi Openmp Examples
27 pages
Untitled document
No ratings yet
Untitled document
23 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
MAP laB mannual
No ratings yet
MAP laB mannual
24 pages
Code: First Method:: (1) Write A C Program Using Open MP To Estimate The Value of PI (Use Minimum Two Methods)
No ratings yet
Code: First Method:: (1) Write A C Program Using Open MP To Estimate The Value of PI (Use Minimum Two Methods)
8 pages
77a3d882-bc70-4699-a880-d8bd3ce01411
No ratings yet
77a3d882-bc70-4699-a880-d8bd3ce01411
24 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
60004210188_RajSingh_HPCexp6
No ratings yet
60004210188_RajSingh_HPCexp6
4 pages
CP4292-MCAP(1)
No ratings yet
CP4292-MCAP(1)
15 pages
CUDA-Multiple GPUs
No ratings yet
CUDA-Multiple GPUs
36 pages
20BCE260
No ratings yet
20BCE260
13 pages
As 3
No ratings yet
As 3
2 pages
Mpi
No ratings yet
Mpi
67 pages
Parallel Programming For Multicore Machines Using Openmp And Mpi Lecture Notes Dr Constantinos Evangelinos instant download
100% (1)
Parallel Programming For Multicore Machines Using Openmp And Mpi Lecture Notes Dr Constantinos Evangelinos instant download
84 pages
Pcap Cse 3263 Lab Manual 2023
No ratings yet
Pcap Cse 3263 Lab Manual 2023
70 pages
Lab 1
No ratings yet
Lab 1
2 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Cse 4001-Parallel and Distributed Computing Lab Digital Assessment-1 Name: Avulapati Anusha REG - NO: 17BCE0435
No ratings yet
Cse 4001-Parallel and Distributed Computing Lab Digital Assessment-1 Name: Avulapati Anusha REG - NO: 17BCE0435
5 pages
HPC_SUMMARY
No ratings yet
HPC_SUMMARY
17 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
PRACE_2012-02_MPI_OpenMP_Rabenseifner
No ratings yet
PRACE_2012-02_MPI_OpenMP_Rabenseifner
171 pages
Manual
No ratings yet
Manual
5 pages
Inf3380 Oblig2 2011
No ratings yet
Inf3380 Oblig2 2011
3 pages
OpenMPBoothTalk PyOMP
No ratings yet
OpenMPBoothTalk PyOMP
25 pages
CSC-334_ P&DC_Lab manual_V2.0
No ratings yet
CSC-334_ P&DC_Lab manual_V2.0
102 pages
E 3 (Openmp - Iii) : Matrix Multiplication
No ratings yet
E 3 (Openmp - Iii) : Matrix Multiplication
10 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Introducing, The All New SNSV ROG Zephyrus M
No ratings yet
Introducing, The All New SNSV ROG Zephyrus M
1 page
Prescan InstallationGuide
No ratings yet
Prescan InstallationGuide
32 pages
Kinect Fusion
No ratings yet
Kinect Fusion
9 pages
Unit I
No ratings yet
Unit I
31 pages
cs179_2024_lec01
No ratings yet
cs179_2024_lec01
26 pages
Hkbu Thesis Format
100% (3)
Hkbu Thesis Format
5 pages
GPU Programming EE 4702-1 Final Examination: Exam Total
No ratings yet
GPU Programming EE 4702-1 Final Examination: Exam Total
10 pages
Huawei FusionServer Pro V5 Rack Server Data Sheet
No ratings yet
Huawei FusionServer Pro V5 Rack Server Data Sheet
22 pages
Computer Hardware 1
No ratings yet
Computer Hardware 1
38 pages
AI2025_Lecture01_inperson
No ratings yet
AI2025_Lecture01_inperson
65 pages
Cryptocurrency Mining Hardware: Striving For Efficiency
No ratings yet
Cryptocurrency Mining Hardware: Striving For Efficiency
37 pages
Three-Dimensional Computer Graphics Architecture: Tulika Mitra and Tzi-Cker Chiueh
No ratings yet
Three-Dimensional Computer Graphics Architecture: Tulika Mitra and Tzi-Cker Chiueh
9 pages
Proviz Rtx 4000 Sff Ada Datasheet 2616456 Web
No ratings yet
Proviz Rtx 4000 Sff Ada Datasheet 2616456 Web
2 pages
Ocr As and A Level Computer Science Heathcote R S Annas Archive
No ratings yet
Ocr As and A Level Computer Science Heathcote R S Annas Archive
384 pages
2503.04941v2
No ratings yet
2503.04941v2
85 pages
Orange Pi One - Orangepi
No ratings yet
Orange Pi One - Orangepi
4 pages
Lastexception 63761561406
No ratings yet
Lastexception 63761561406
8 pages
@ASM - Bookz Continued Rise of The Cloud - Advances and Trend
0% (1)
@ASM - Bookz Continued Rise of The Cloud - Advances and Trend
415 pages
Qibo: A Framework For Quantum Simulation With Hardware Acceleration
No ratings yet
Qibo: A Framework For Quantum Simulation With Hardware Acceleration
15 pages
Module 9 10
No ratings yet
Module 9 10
18 pages
Nptel Cao Imp Questions
No ratings yet
Nptel Cao Imp Questions
58 pages
Flashattention: Fast and Memory-Efficient Exact Attention With Io-Awareness
No ratings yet
Flashattention: Fast and Memory-Efficient Exact Attention With Io-Awareness
34 pages
Portwell Technologies
No ratings yet
Portwell Technologies
172 pages
Cost-Effective Live Event Streaming
No ratings yet
Cost-Effective Live Event Streaming
15 pages
Download Disabilities 3 volumes Insights from across Fields and around the World 1st Edition Catherine A. Marshall ebook All Chapters PDF
No ratings yet
Download Disabilities 3 volumes Insights from across Fields and around the World 1st Edition Catherine A. Marshall ebook All Chapters PDF
77 pages
Snapdragon 6 Gen 3 Product Brief
No ratings yet
Snapdragon 6 Gen 3 Product Brief
2 pages
Computer Graphics Report
No ratings yet
Computer Graphics Report
26 pages
Device Config
No ratings yet
Device Config
4 pages

Tutorial 4

Uploaded by

Tutorial 4

Uploaded by

COMP 4007: Parallel Processing and Computer Architecture

• Part 2: MPI + CUDA

• Pros OpenMP OpenMP

• No message passing inside of the shared-memory MPI MPI

• thread_level_required specifies the requested level of thread support.

#pragma omp parallel default(shared) private(iam, np)

• Check & Install OpenMPI, OpenMP if not

• With UVA. Send the data in the device memory.

• 2. Send data to different processes in host memory:

• 4. Copy the Data from host to device in each process:

• 5. Do the calculation in each process:

• 7. Gather the result:

• 2. Allocate the memory on the device memory in the master process:

• 3. Copy the Data from host to device in the master process :

• 4. Allocating the memory in the device memory in each process:

• 5. Send data to different processes in device memory:

• 7. Gather the result in the device memory in the master process:

• 8. Copy the result from device to host in the master process :

You might also like