0% found this document useful (0 votes)

5 views

02 CUDA Shared Memory

The document provides an overview of CUDA shared memory, highlighting the differences between host (CPU) and device (GPU) programming, memory management functions, and the implementation of a 1D stencil operation using shared memory. It discusses the importance of synchronizing threads within a block to avoid data hazards and introduces cooperative groups for efficient thread communication. Additionally, it includes resources for further study and homework assignments for practical application.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

02 CUDA Shared Memory

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CUDA SHARED MEMORY

NVIDIA Corporation
REVIEW (1 OF 2)

Difference between host and device

Host CPU

Device GPU

Using global to declare a function as device code

Executes on the device

Called from the host (or possibly from other device code)

Passing parameters from host code to a device function

2
REVIEW (2 OF 2)

Basic device memory management

cudaMalloc()

cudaMemcpy()
cudaFree()

Launching parallel kernels

Launch N copies of add() with add<<<N,1>>>(…);

Use blockIdx.x to access block index

3
1D STENCIL

Consider applying a 1D stencil to a 1D array of elements

Each output element is the sum of input elements within a radius

If radius is 3, then each output element is the sum of 7 input elements:

radius radius

4
IMPLEMENTING WITHIN A BLOCK

Each thread processes one output element

blockDim.x elements per block

Input elements are read several times

With radius 3, each input element is read seven times

5
SHARING DATA BETWEEN THREADS

Terminology: within a block, threads share data via shared memory

Extremely fast on-chip memory, user-managed

Declare using shared, allocated per block

Data is not visible to threads in other blocks

6
IMPLEMENTING WITH SHARED MEMORY

Cache data in shared memory

Read (blockDim.x + 2 * radius) input elements from global memory to shared memory

Compute blockDim.x output elements

Write blockDim.x output elements to global memory

Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements 7

STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}

8
STENCIL KERNEL
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

9
DATA RACE!
The stencil example will not work…

Suppose thread 15 reads the halo before thread 0 has fetched

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) { Store at temp[18]
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
} Skipped, threadIdx > RADIUS
int result = 0;
result += temp[lindex + 1]; Load from temp[19]

10
__SYNCTHREADS()

void __syncthreads();
Synchronizes all threads within a block

Used to prevent RAW / WAR / WAW hazards

All threads must reach the barrier

In conditional code, the condition must be uniform across the block

11
STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;

// Read input elements intoStencil

sharedKernel
memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
12
STENCIL KERNEL

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
Stencil Kernel
// Store the result
out[gindex] = result;
}

13
REVIEW

Use shared to declare a variable/array in shared memory

Data is shared between threads in a block

Not visible to threads in other blocks

Use __syncthreads() as a barrier

Use to prevent data hazards

14
LOOKING FORWARD

Cooperative Groups: a flexible model for synchronization and

communication within groups of threads.

DEVELOPERS
At a glance Benefits all applications

Scalable Cooperation among groups of threads Examples include:

Persistent RNNs
Flexible parallel decompositions Physics
Search Algorithms
Sorting
Composition across software boundaries

Deploy Everywhere

15
FOR EXAMPLE: THREAD BLOCK
Implicit group of all the threads in the launched thread block

Implements the same interface as thread_group:

void sync(); // Synchronize the threads in the group

unsigned size(); // Total number of threads in the group

unsigned thread_rank(); // Rank of the calling thread within [0, size)

bool is_valid(); // Whether the group violated any API constraints

And additional thread_block specific functions:

dim3 group_index(); // 3-dimensional block index within the grid

dim3 thread_index(); // 3-dimensional thread index within the block

16
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache

Directed testing: shared in global

Cache: vs shared
Average
Shared 93%
• Easier to use Memory
Benefit
• 90%+ as good
70%

Shared: vs cache
• Faster atomics

• More banks
• More predictable
Pascal Volta
17
FUTURE SESSIONS

CUDA GPU architecture and basic optimizations

Atomics, Reductions, Warp Shuffle

Using Managed Memory

Concurrency (streams, copy/compute overlap, multi-GPU)

Analysis Driven Optimization

Cooperative Groups

18
FURTHER STUDY

Shared memory:

https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

CUDA Programming Guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

CUDA Documentation:

https://docs.nvidia.com/cuda/index.html

https://docs.nvidia.com/cuda/cuda-runtime-api/index.html (runtime API)

19
HOMEWORK

Log into Summit (ssh [email protected] -> ssh summit)

Clone GitHub repository:

Git clone [email protected]:olcf/cuda-training-series.git

Follow the instructions in the readme.md file:

https://github.com/olcf/cuda-training-series/blob/master/exercises/hw2/readme.md

Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming

20
QUESTIONS?

TestSuite Elite Users Manual
No ratings yet
TestSuite Elite Users Manual
684 pages
WebMethods Best Practices
No ratings yet
WebMethods Best Practices
4 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
chapter-8
No ratings yet
chapter-8
58 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Class 10
No ratings yet
Class 10
13 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
Class13
No ratings yet
Class13
19 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
CUDA
No ratings yet
CUDA
33 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
LM32_AIT_L21
No ratings yet
LM32_AIT_L21
19 pages
05_Atomics_Reductions_Warp_Shuffle 05_Atomics_Reductions_Warp_Shuffle
No ratings yet
05_Atomics_Reductions_Warp_Shuffle 05_Atomics_Reductions_Warp_Shuffle
27 pages
VSCSE-Lecture3-cuda-memory-model-2012
No ratings yet
VSCSE-Lecture3-cuda-memory-model-2012
31 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
clustering
No ratings yet
clustering
1 page
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
2112.10318
No ratings yet
2112.10318
34 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
Loading Pandas
No ratings yet
Loading Pandas
23 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
بارگذاری فایل
No ratings yet
بارگذاری فایل
2 pages
KNN in python
No ratings yet
KNN in python
11 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
subdivision
No ratings yet
subdivision
5 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
Hybrid Converter: Instruction Manual
No ratings yet
Hybrid Converter: Instruction Manual
27 pages
Readme V Plug
No ratings yet
Readme V Plug
34 pages
The Reset Glitch Hack - EN
100% (2)
The Reset Glitch Hack - EN
20 pages
Blessing Komponen 2 Mei 2023
No ratings yet
Blessing Komponen 2 Mei 2023
228 pages
g-تحميل, برنامج تعليم القراءة السريعة - ar speed ,reading 1.3.2 - Download ,speed reading program.20121105.225752
100% (2)
g-تحميل, برنامج تعليم القراءة السريعة - ar speed ,reading 1.3.2 - Download ,speed reading program.20121105.225752
3 pages
SM-A107F UM ASIA QQ Eng Rev.1.0 200511 PDF
No ratings yet
SM-A107F UM ASIA QQ Eng Rev.1.0 200511 PDF
119 pages
Chapter 4 Overview of Preventive Maintenance
No ratings yet
Chapter 4 Overview of Preventive Maintenance
14 pages
UPS Global Operations With The DIAD IV: Kenneth C. Laudon and Jane P. Laudon
No ratings yet
UPS Global Operations With The DIAD IV: Kenneth C. Laudon and Jane P. Laudon
4 pages
Blocos OBs, SFCs e SFBs Integrados Ao CLP Vipa Speed 7
No ratings yet
Blocos OBs, SFCs e SFBs Integrados Ao CLP Vipa Speed 7
58 pages
Python
No ratings yet
Python
35 pages
Cat9500 Seri Data Sheet
No ratings yet
Cat9500 Seri Data Sheet
52 pages
OOP Unit 4 Notes
No ratings yet
OOP Unit 4 Notes
10 pages
Cisco 2801
100% (1)
Cisco 2801
1 page
CP Fault Opi For Apz 212 60
100% (1)
CP Fault Opi For Apz 212 60
4 pages
HW Install Soft
No ratings yet
HW Install Soft
6 pages
Service Manual: Conf Idential
No ratings yet
Service Manual: Conf Idential
25 pages
Automatic Centre 2018 Platinum Christmas Catalog
100% (5)
Automatic Centre 2018 Platinum Christmas Catalog
84 pages
Algorithm and Programming Language MCQ'S
100% (3)
Algorithm and Programming Language MCQ'S
25 pages
Embedded Document
No ratings yet
Embedded Document
50 pages
Hawassa University Chapter 2 Linux Operating System Project
No ratings yet
Hawassa University Chapter 2 Linux Operating System Project
20 pages
Intel® Desktop Board DG41TY: Technical Product Specification
No ratings yet
Intel® Desktop Board DG41TY: Technical Product Specification
88 pages
Definition of Computer
No ratings yet
Definition of Computer
14 pages
Online Portfolio Resume
No ratings yet
Online Portfolio Resume
2 pages
Faculty of Science & Technology Savitribai Phule Pune University Pune, Maharashtra, India
No ratings yet
Faculty of Science & Technology Savitribai Phule Pune University Pune, Maharashtra, India
56 pages
3do m2
No ratings yet
3do m2
4 pages
Tle-Css Q3
No ratings yet
Tle-Css Q3
31 pages
Chapter 4 Concurrency Control Techniques
No ratings yet
Chapter 4 Concurrency Control Techniques
41 pages
Openflow-Spec-V1 3 0
No ratings yet
Openflow-Spec-V1 3 0
106 pages

02 CUDA Shared Memory

Uploaded by

02 CUDA Shared Memory

Uploaded by

CUDA SHARED MEMORY

Difference between host and device

Using __global__ to declare a function as device code

Executes on the device

Passing parameters from host code to a device function

Basic device memory management

Launching parallel kernels

Launch N copies of add() with add<<<N,1>>>(…);

Consider applying a 1D stencil to a 1D array of elements

Each output element is the sum of input elements within a radius

If radius is 3, then each output element is the sum of 7 input elements:

Each thread processes one output element

blockDim.x elements per block

Input elements are read several times

With radius 3, each input element is read seven times

Terminology: within a block, threads share data via shared memory

Extremely fast on-chip memory, user-managed

Declare using __shared__, allocated per block

Data is not visible to threads in other blocks

Cache data in shared memory

Compute blockDim.x output elements

Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements 7

// Read input elements into shared memory

// Store the result

Suppose thread 15 reads the halo before thread 0 has fetched

Used to prevent RAW / WAR / WAW hazards

All threads must reach the barrier

In conditional code, the condition must be uniform across the block

// Read input elements intoStencil

// Apply the stencil

Use __shared__ to declare a variable/array in shared memory

Data is shared between threads in a block

Not visible to threads in other blocks

Use __syncthreads() as a barrier

Cooperative Groups: a flexible model for synchronization and

Scalable Cooperation among groups of threads Examples include:

Implements the same interface as thread_group:

void sync(); // Synchronize the threads in the group

unsigned size(); // Total number of threads in the group

unsigned thread_rank(); // Rank of the calling thread within [0, size)

bool is_valid(); // Whether the group violated any API constraints

And additional thread_block specific functions:

dim3 group_index(); // 3-dimensional block index within the grid

dim3 thread_index(); // 3-dimensional thread index within the block

Directed testing: shared in global

CUDA GPU architecture and basic optimizations

Atomics, Reductions, Warp Shuffle

Using Managed Memory

Concurrency (streams, copy/compute overlap, multi-GPU)

Analysis Driven Optimization

CUDA Programming Guide:

https://docs.nvidia.com/cuda/cuda-runtime-api/index.html (runtime API)

Log into Summit (ssh [email protected] -> ssh summit)

Clone GitHub repository:

Git clone [email protected]:olcf/cuda-training-series.git

Follow the instructions in the readme.md file:

You might also like

Using global to declare a function as device code

Declare using shared, allocated per block

Use shared to declare a variable/array in shared memory