Lecture3 Fundamentals of CUDA(Part1)_2025
Lecture3 Fundamentals of CUDA(Part1)_2025
Fundamentals of CUDA 1
2
Agenda
History
3
3D Graphics Pipeline
4
3D Graphics Pipeline
triangles/lines/points
vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting
Vertex
Buffer Texture
Color
Objects Environme Fog
Sum
nt
Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend
5
3D Graphics Pipeline
triangles/lines/points
vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting
Vertex
Buffer Texture
Color
Objects Environme Fog
Sum
nt
Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend
6
3D Graphics Pipeline
vertex shaded
triangles/lines/points
vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting
Vertex rasterized
Buffer Texture
Color
Objects Environme Fog
Sum
nt
Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend
7
3D Graphics Pipeline
triangles/lines/points
vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting
Vertex rasterized
Buffer Texture
Color
Objects Environme Fog
Sum
nt
Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend
pixel
shaded
8
The Graphics Pipeline – 1st Gen.
▪ One chip/board per stage
▪ Fixed data flow through pipeline
triangles/lines/points
vertices
Primitive Transform
Primitive Rasterize
API Processin and
Assembly r
g Lighting
Vertex
Buffer Texture
Color
Objects Environm Fog
Sum
ent
Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend
9
The Graphics Pipeline – 2nd Gen.
▪ Everything fixed function,
with a certain number of modes
▪ Number of modes for each stage grew over time
▪ Hard to optimize HW
▪ Developers always wanted more flexibility
triangles/lines/points
vertices
Primitive Transform
Primitive Rasterize
API Processin and
Assembly r
g Lighting
Vertex
Buffer Texture
Color
Objects Environm Fog
Sum
ent
Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend
10
The Graphics Pipeline – 3rd Gen.
▪ Vertex & pixel processing became programmable
▪ GPU architecture increasingly centers around `shader’ execution
triangles/lines/points
vertic
Primitive
es Vertex Primitive
API Processin Rasterizer
Shader Assembly
g
Vertex
Buffer Pixel
Objects
Shader
Color
Depth Frame
Buffer Dither
Stencil Buffer
Blend
11
Before CUDA
▪ Drawback:
o Tough learning curve
o Potentially high overhead of graphics API
o Highly constrained memory layout & access model
Before CUDA
▪ What's wrong with the old GPGPU programming model 1
What is CUDA?
15
What is CUDA?
▪ What is CUDA?: Compute Unified Device Architecture
o Parallel computing platform and application programming interface (API) model
o A powerful parallel programming model for issuing and managing computations
on the GPU without mapping them to a graphics API
▪ Targeted Software stack
o Library, Runtime, Driver
▪ Advantages
o SW: program the GPU in C
• Scalable data parallel execution/memory model
• C with minimal yet powerful extensions
o HW: fully general data-parallel architecture
▪ Features
o Heterogenous - mixed serial-parallel programming
o Scalable - hierarchical thread execution model
o Accessible - minimal but expressive changes to C
What is CUDA?
Pixel Shader
(program)
Review : Heterogeneous Computing
▪ Use more than one kind of processor or cores
o CPUs for sequential parts
o GPUs for parallel parts
main memory main memory
CPU
GPU
video
memory
18
Simple CUDA Model
▪ Host : CPU + main memory (host memory)
▪ Device : GPU + video memory (device memory)
Host Device
main video
CPU GPU
Memory Memory
Simple CUDA Model
▪ GNU gcc : linux c compiler
▪ nvcc: NVIDIA CUDA compiler
main video
CPU GPU
Memory Memory
Host Device
CUDA Program Execution Scenario
▪ Integrated “host+device” app C program
o Serial or modestly parallel parts in host code
o Highly parallel parts in device code host code
(serial)
22
CUDA program uses CUDA memory
▪ GPU cores share the “global memory” (device memory)
o DRAM (e.g., GDDR, HBM) is used as global memory
▪ To execute a kernel on a device,
o allocate global memory on the device
o transfer data from the host memory to allocated device
Red lines are global memory
memory
o transfer result data from the device memory back to the
host memory
o release global memory
SMEM
SMEM
SMEM
SMEM
Host
CPU
Global Memory Memory
23 Device
Memory Spaces (before UVM!!)
24
CPU Memory Allocation / Release
▪ Host (CPU) manages host (CPU) memory:
o void* malloc (size_t nbytes)
o void* memset (void* pointer, int value, size_t count)
o void free (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int* ptr = 0;
ptr = malloc( nbytes );
memset( ptr, 0, nbytes);
free( ptr );
25
GPU Memory Allocation / Release
▪ Host (CPU) manages device (GPU) memory:
o cudaMalloc (void** pointer, size_t nbytes)
o cudaMemset (void* pointer, int value, size_t count)
o cudaFree (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int* dev_a = 0;
cudaMalloc( (void**)&dev_a, nbytes );
cudaMemset( dev_a, 0, nbytes);
cudaFree(dev_a);
26
CUDA function rules
▪ Every library function starts with “cuda”
▪ Example:
o if (cudaMalloc(&devPtr, SIZE) != cudaSuccess) {
exit(1);
}
27
CUDA Malloc
▪ cudaError_t cudaMalloc( void** devPtr, size_t nbytes );
o allocates nbytes bytes of linear memory on the device
o The start address is stored into “devPtr”
o The memory is not cleared.
o returns cudaSuccess or cudaErrorMemoryAllocation
28
CUDA mem set
▪ cudaError_t cudaMemset( void* devPtr, int value, size_t nbytes );
o fills the first nbytes byte of the memory area pointed by devPtr with the value
o returns cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
29
Data Copy
▪ cudaError_t cudaMemcpy( void* dst,
void* src,
size_t nbytes,
enum cudaMemcpyKind direction);
o returns after the copy is complete
o blocks CPU thread until all bytes have been copied
o doesn’t start copying until previous CUDA calls complete
▪ enum cudaMemcpyKind
o cudaMemcpyHostToDevice
o cudaMemcpyDeviceToHost
o cudaMemcpyDeviceToDevice
o cudaMemcpyHostToHost
30
Data Copy
▪ host → host : memcpy (in C/C++)
▪ host → device, device → device, device → host : cudaMemcpy (CUDA)
main video
CPU GPU
Memory Memory
Host Device
31
Example: Host-Device Mem copy
▪ step 1.
o make a block of data
o print out the source data
▪ step 2.
o copy from host memory to device memory
o copy from device memory to device memory
o copy from device memory to host memory
▪ step 3.
o print out the result
Host Code Kernel
host device
CPU GPU
Memory Memory
32 Host Device
Code: cuda memcpy (1/4)
#include <iostream>
int main(void) {
// host-side data
const int SIZE = 5;
const int a[SIZE] = { 1, 2, 3, 4, 5 }; // source data
int b[SIZE] = { 0, 0, 0, 0, 0 }; // final destination
// print source
printf("a = {%d,%d,%d,%d,%d}\n", a[0], a[1], a[2], a[3], a[4]);
33
Code: cuda memcpy (2/4)
// device-side data
int* dev_a = 0;
int* dev_b = 0;
34
Code: cuda memcpy (3/4)
35
Code: cuda memcpy (4/4)
36
Execution Result
▪ Compile the source code
o nvcc memcpy.cu -o ./memcpy
▪ executing ./memcpy
a = {1,2,3,4,5}
b = {1,2,3,4,5}
37
Agenda
Error Checking
38
Error Checking and Handling in CUDA
▪ It is important for a program to check and handle errors
▪ CUDA API functions return flags that indicate whether an error has
occurred
▪ Example:
o if (cudaMalloc(&devPtr, SIZE) != cudaSuccess) {
exit(1);
}
39
cudaError_t : data type
▪ typedef enum cudaError cudaError_t
▪ possible values:
o cudaSuccess, cudaErrorMissingConfiguration, cudaErrorMemoryAllocation,
cudaErrorInitializationError, cudaErrorLaunchFailure, cudaErrorLaunchTimeout,
cudaErrorLaunchOutOfResources, cudaErrorInvalidDeviceFunction,
cudaErrorInvalidConfiguration, cudaErrorInvalidDevice, cudaErrorInvalidValue,
cudaErrorInvalidPitchValue, cudaErrorInvalidSymbol, cudaErrorUnmapBufferObjectFailed,
cudaErrorInvalidHostPointer, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture,
cudaErrorInvalidTextureBinding, cudaErrorInvalidChannelDescriptor,
cudaErrorInvalidMemcpyDirection, cudaErrorInvalidFilterSetting,
cudaErrorInvalidNormSetting, cudaErrorUnknown, cudaErrorNotYetImplemented,
cudaErrorInvalidResourceHandle, cudaErrorInsufficientDriver, cudaErrorSetOnActiveProcess,
cudaErrorStartupFailure, cudaErrorApiFailureBase
40
cudaGetErrorName( err )
▪ const char* cudaGetErrorName( cudaError_t err )
o err : error code to convert to string
o returns:
• char* to a NULL-terminated string
• NULL if the error code is not valid
41
cudaGetErrorString( err )
▪ const char* cudaGetErrorString( cudaError_t err )
o err : error code to convert to string
o returns:
• char* to a NULL-terminated string
• NULL if the error code is not valid
42
cudaGetLastError( void )
▪ cudaError_t cudaGetLastError( void)
o returns the last error due to CUDA runtime calls in the same host thread
o and resets it to cudaSuccess
o So, if no CUDA error since the last call, it returns cudaSuccess
o For multiple errors, it contains the last error only.
43
A simple CUDA error check code
cudaMemcpy( … );
cudaError_t e = cudaGetLastError();
if (e != cudaSuccess) {
printf(“cuda failure %s:%d: '%s'\n",
__FILE__, __LINE__,
cudaGetErrorString(e) );
exit(0);
}
44
A simple CUDA error check macro
#define cudaCheckError( ) do { \
cudaError_t e = cudaGetLastError(); \
if (e != cudaSuccess) { \
printf(“cuda failure %s:%d: '%s'\n", \
__FILE__, __LINE__, \
cudaGetErrorString(e) ); \
exit(0); \
}\
} while (0)
45
Example
▪ code segment
46
More advanced macro
#ifdef DEBUG // debug mode
#define CUDA_CHECK(x) do {\
(x); \
cudaError_t e = cudaGetLastError(); \
if (cudaSuccess != e) { \
printf("cuda failure %s at %s:%d\n", \
cudaGetErrorString(e), \
__FILE__, __LINE__); \
exit(1); \
}\
} while (0)
#else
#define CUDA_CHECK(x) (x) // release mode
#endif
47
error_check.cu
#include <iostream>
#ifdef DEBUG
#define CUDA_CHECK(x) do {\
(x); \
cudaError_t e = cudaGetLastError(); \
if (cudaSuccess != e) { \
printf("cuda failure \"%s\" at %s:%d\n", \
cudaGetErrorString(e), \
__FILE__, __LINE__); \
exit(1); \
}\
} while (0)
#else
#define CUDA_CHECK(x) (x)
#endif
48
error_check.cu
// main program for the CPU
int main(void) {
// host-side data
const int SIZE = 5;
const int a[SIZE] = { 1, 2, 3, 4, 5 };
int b[SIZE] = { 0, 0, 0, 0, 0 };
// print source
printf("a = {%d,%d,%d,%d,%d}\n", a[0], a[1], a[2], a[3], a[4]);
// device-side data
int *dev_a = 0;
int *dev_b = 0;
// allocate device memory
CUDA_CHECK( cudaMalloc((void**)&dev_a, SIZE * sizeof(int)) );
CUDA_CHECK( cudaMalloc((void**)&dev_b, SIZE * sizeof(int)) );
// copy from host to device
CUDA_CHECK( cudaMemcpy(dev_a, a, SIZE * sizeof(int), cudaMemcpyDeviceToDevice) ); // BOMB here !
// copy from device to device
CUDA_CHECK( cudaMemcpy(dev_b, dev_a, SIZE * sizeof(int), cudaMemcpyDeviceToDevice) );
// copy from device to host
CUDA_CHECK( cudaMemcpy(b, dev_b, SIZE * sizeof(int), cudaMemcpyDeviceToHost) );
// free device memory
CUDA_CHECK( cudaFree(dev_a) );
CUDA_CHECK( cudaFree(dev_b) );
// print the result
printf("b = {%d,%d,%d,%d,%d}\n", b[0], b[1], b[2], b[3], b[4]);
// done
return 0;
}
49
Execution Result
▪ Compile the source code
o nvcc error_check.cu -DDEBUG -o ./error_check
o nvcc error_check.cu -o ./error_check
▪ executing ./error_check
a = {1,2,3,4,5}
b = {………..}
50
Agenda
▪ What is CUDA?
▪ Device Global Memory and Data Transfer
▪ Error Checking
▪ A Vector Addition Kernel
▪ Kernel Functions and Threading
▪ Kernel Launch
51
CUDA Resources
▪ CUDA API reference:
o http://docs.nvidia.com/cuda/index.html
o http://docs.nvidia.com/cuda/cuda-runtime-api/index.html
▪ CUDA course:
o https://developer.nvidia.com/cuda-education-training
o https://developer.nvidia.com/cuda-training
52