02 CUDA Shared Memory
02 CUDA Shared Memory
NVIDIA Corporation
REVIEW (1 OF 2)
Host CPU
Device GPU
Called from the host (or possibly from other device code)
2
REVIEW (2 OF 2)
cudaMalloc()
cudaMemcpy()
cudaFree()
3
1D STENCIL
radius radius
4
IMPLEMENTING WITHIN A BLOCK
5
SHARING DATA BETWEEN THREADS
6
IMPLEMENTING WITH SHARED MEMORY
Read (blockDim.x + 2 * radius) input elements from global memory to shared memory
8
STENCIL KERNEL
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
9
DATA RACE!
The stencil example will not work…
10
__SYNCTHREADS()
void __syncthreads();
Synchronizes all threads within a block
11
STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
13
REVIEW
14
LOOKING FORWARD
DEVELOPERS
At a glance Benefits all applications
Deploy Everywhere
15
FOR EXAMPLE: THREAD BLOCK
Implicit group of all the threads in the launched thread block
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Pascal Volta
17
FUTURE SESSIONS
Cooperative Groups
18
FURTHER STUDY
Shared memory:
https://devblogs.nvidia.com/using-shared-memory-cuda-cc/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
CUDA Documentation:
https://docs.nvidia.com/cuda/index.html
19
HOMEWORK
https://github.com/olcf/cuda-training-series/blob/master/exercises/hw2/readme.md
Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming
20
QUESTIONS?