0% found this document useful (0 votes)
9 views52 pages

Lecture3 Fundamentals of CUDA(Part1)_2025

The document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform and API model designed for managing computations on GPUs. It discusses the architecture of the graphics pipeline, memory management, and error checking in CUDA programming. Key features include the ability to program GPUs in C, a scalable thread execution model, and the separation of host and device memory.

Uploaded by

shdudtls2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views52 pages

Lecture3 Fundamentals of CUDA(Part1)_2025

The document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform and API model designed for managing computations on GPUs. It discusses the architecture of the graphics pipeline, memory management, and error checking in CUDA programming. Key features include the ability to program GPUs in C, a scalable thread execution model, and the separation of host and device memory.

Uploaded by

shdudtls2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

SEE3001 Parallel Computer Architecture and Programming

Fundamentals of CUDA 1

Prof. Seokin Hong

Slide Credit: Slides are modified from Prof Baek’s slides


Agenda
▪ History
▪ What is CUDA?
▪ Device Global Memory and Data Transfer
▪ Error Checking
▪ A Vector Addition Kernel
▪ Kernel Functions and Threading
▪ Kernel Launch

2
Agenda

History

3
3D Graphics Pipeline

4
3D Graphics Pipeline

Host (main board) GPU (graphics card)

triangles/lines/points

vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting

Vertex
Buffer Texture
Color
Objects Environme Fog
Sum
nt

Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend

5
3D Graphics Pipeline

user input vertex shaded

triangles/lines/points

vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting

Vertex
Buffer Texture
Color
Objects Environme Fog
Sum
nt

Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend

6
3D Graphics Pipeline

vertex shaded

triangles/lines/points

vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting

Vertex rasterized
Buffer Texture
Color
Objects Environme Fog
Sum
nt

Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend

7
3D Graphics Pipeline

triangles/lines/points

vertice Transform
Primitive s Primitive
API and Rasterizer
Processing Assembly
Lighting

Vertex rasterized
Buffer Texture
Color
Objects Environme Fog
Sum
nt

Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend

pixel
shaded

8
The Graphics Pipeline – 1st Gen.
▪ One chip/board per stage
▪ Fixed data flow through pipeline

triangles/lines/points

vertices
Primitive Transform
Primitive Rasterize
API Processin and
Assembly r
g Lighting

Vertex
Buffer Texture
Color
Objects Environm Fog
Sum
ent

Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend

9
The Graphics Pipeline – 2nd Gen.
▪ Everything fixed function,
with a certain number of modes
▪ Number of modes for each stage grew over time
▪ Hard to optimize HW
▪ Developers always wanted more flexibility

triangles/lines/points

vertices
Primitive Transform
Primitive Rasterize
API Processin and
Assembly r
g Lighting

Vertex
Buffer Texture
Color
Objects Environm Fog
Sum
ent

Color
Alpha Depth Frame
Buffer Dither
Test Stencil Buffer
Blend

10
The Graphics Pipeline – 3rd Gen.
▪ Vertex & pixel processing became programmable
▪ GPU architecture increasingly centers around `shader’ execution

triangles/lines/points

vertic
Primitive
es Vertex Primitive
API Processin Rasterizer
Shader Assembly
g

Vertex
Buffer Pixel
Objects
Shader

Color
Depth Frame
Buffer Dither
Stencil Buffer
Blend

11
Before CUDA

▪ Use the GPU for general-purpose

▪ Computing by casting problem as graphics


o Turn data into images ("texture maps")
o Turn algorithms into image synthesis ("rending passes")

▪ Drawback:
o Tough learning curve
o Potentially high overhead of graphics API
o Highly constrained memory layout & access model
Before CUDA
▪ What's wrong with the old GPGPU programming model 1

APIs are specific to Graphics

Limited texture size and


dimension

Pixel Shader (program)

Limited Instruction set Limited local storage


No thread Communication
Pixel Shader

Limited shader outputs


Before CUDA
▪ What's wrong with the old GPGPU programming model 2
Agenda

What is CUDA?

15
What is CUDA?
▪ What is CUDA?: Compute Unified Device Architecture
o Parallel computing platform and application programming interface (API) model
o A powerful parallel programming model for issuing and managing computations
on the GPU without mapping them to a graphics API
▪ Targeted Software stack
o Library, Runtime, Driver
▪ Advantages
o SW: program the GPU in C
• Scalable data parallel execution/memory model
• C with minimal yet powerful extensions
o HW: fully general data-parallel architecture
▪ Features
o Heterogenous - mixed serial-parallel programming
o Scalable - hierarchical thread execution model
o Accessible - minimal but expressive changes to C
What is CUDA?

Pixel Shader
(program)
Review : Heterogeneous Computing
▪ Use more than one kind of processor or cores
o CPUs for sequential parts
o GPUs for parallel parts
main memory main memory

CPU

GPU
video
memory
18
Simple CUDA Model
▪ Host : CPU + main memory (host memory)
▪ Device : GPU + video memory (device memory)

Host Device

main video
CPU GPU
Memory Memory
Simple CUDA Model
▪ GNU gcc : linux c compiler
▪ nvcc: NVIDIA CUDA compiler

GNU gcc nvcc


Source code(.cu)

CPU Code GPU Code


(aka. Host Code) (aka. Device code
or Kernel)

main video
CPU GPU
Memory Memory

Host Device
CUDA Program Execution Scenario
▪ Integrated “host+device” app C program
o Serial or modestly parallel parts in host code
o Highly parallel parts in device code host code
(serial)

▪ Execution Scenario Device code


(parallel) ...
o Step 1: host code
• Serial execution: read data
• Prepare parallel execution host code
(serial)
• Copy data from host memory to device memory
o Step 2: device code (kernel)
• Parallel processing
• Read/write data from device memory to device memory
o Step 3: host code
• Copy data from device memory to host memory
• Serial execution: print data
Agenda

Device Global Memory and Data Transfer

22
CUDA program uses CUDA memory
▪ GPU cores share the “global memory” (device memory)
o DRAM (e.g., GDDR, HBM) is used as global memory
▪ To execute a kernel on a device,
o allocate global memory on the device
o transfer data from the host memory to allocated device
Red lines are global memory
memory
o transfer result data from the device memory back to the
host memory
o release global memory
SMEM

SMEM

SMEM

SMEM

Host
CPU
Global Memory Memory

23 Device
Memory Spaces (before UVM!!)

▪ CPU and GPU have separate memory spaces


o Data is moved across data bus
o Use functions to allocate/set/copy memory on GPU
o Very similar to corresponding C functions

▪ Pointers are just addresses


o Use pointer to access CPU and GPU memory
o Can’t tell from the pointer value whether the address is on CPU or GPU
o Dereferencing CPU pointer on GPU will likely crash
o Same for vice versa

24
CPU Memory Allocation / Release
▪ Host (CPU) manages host (CPU) memory:
o void* malloc (size_t nbytes)
o void* memset (void* pointer, int value, size_t count)
o void free (void* pointer)

int n = 1024;
int nbytes = 1024*sizeof(int);
int* ptr = 0;
ptr = malloc( nbytes );
memset( ptr, 0, nbytes);
free( ptr );

25
GPU Memory Allocation / Release
▪ Host (CPU) manages device (GPU) memory:
o cudaMalloc (void** pointer, size_t nbytes)
o cudaMemset (void* pointer, int value, size_t count)
o cudaFree (void* pointer)

int n = 1024;
int nbytes = 1024*sizeof(int);
int* dev_a = 0;
cudaMalloc( (void**)&dev_a, nbytes );
cudaMemset( dev_a, 0, nbytes);
cudaFree(dev_a);

26
CUDA function rules
▪ Every library function starts with “cuda”

▪ Most of them returns error code (or cudaSuccess).


o cudaError_t cudaMalloc(void** devPtr, size_t size);
o cudaError_t cudaFree(void* devPtr);
o cudaError_t cudaMemcpy(void* dst, const void* src, size_t size);

▪ Example:
o if (cudaMalloc(&devPtr, SIZE) != cudaSuccess) {
exit(1);
}

27
CUDA Malloc
▪ cudaError_t cudaMalloc( void** devPtr, size_t nbytes );
o allocates nbytes bytes of linear memory on the device
o The start address is stored into “devPtr”
o The memory is not cleared.
o returns cudaSuccess or cudaErrorMemoryAllocation

▪ cudaError_t cudaFree( void* devPtr );


o frees the memory space pointed by devPtr
o if devPtr == 0, no operation
o returns cudaSuccess or cudaErrorInvalidDevicePointer

28
CUDA mem set
▪ cudaError_t cudaMemset( void* devPtr, int value, size_t nbytes );
o fills the first nbytes byte of the memory area pointed by devPtr with the value
o returns cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer

29
Data Copy
▪ cudaError_t cudaMemcpy( void* dst,
void* src,
size_t nbytes,
enum cudaMemcpyKind direction);
o returns after the copy is complete
o blocks CPU thread until all bytes have been copied
o doesn’t start copying until previous CUDA calls complete

▪ enum cudaMemcpyKind
o cudaMemcpyHostToDevice
o cudaMemcpyDeviceToHost
o cudaMemcpyDeviceToDevice
o cudaMemcpyHostToHost

30
Data Copy
▪ host → host : memcpy (in C/C++)
▪ host → device, device → device, device → host : cudaMemcpy (CUDA)

Host Code Kernel

main video
CPU GPU
Memory Memory

Host Device

31
Example: Host-Device Mem copy
▪ step 1.
o make a block of data
o print out the source data
▪ step 2.
o copy from host memory to device memory
o copy from device memory to device memory
o copy from device memory to host memory
▪ step 3.
o print out the result
Host Code Kernel

host device
CPU GPU
Memory Memory

32 Host Device
Code: cuda memcpy (1/4)

#include <iostream>

int main(void) {
// host-side data
const int SIZE = 5;
const int a[SIZE] = { 1, 2, 3, 4, 5 }; // source data
int b[SIZE] = { 0, 0, 0, 0, 0 }; // final destination

// print source
printf("a = {%d,%d,%d,%d,%d}\n", a[0], a[1], a[2], a[3], a[4]);

33
Code: cuda memcpy (2/4)

// device-side data
int* dev_a = 0;
int* dev_b = 0;

// allocate device memory


cudaMalloc((void**)&dev_a, SIZE * sizeof(int));
cudaMalloc((void**)&dev_b, SIZE * sizeof(int));

// copy from host to device


cudaMemcpy(dev_a, a, SIZE * sizeof(int), cudaMemcpyHostToDevice);

34
Code: cuda memcpy (3/4)

// copy from device to device


cudaMemcpy(dev_b, dev_a, SIZE * sizeof(int),
cudaMemcpyDeviceToDevice);
// copy from device to host
cudaMemcpy(b, dev_b, SIZE * sizeof(int), cudaMemcpyDeviceToHost);

35
Code: cuda memcpy (4/4)

// free device memory


cudaFree(dev_a);
cudaFree(dev_b);
// print the result
printf("b = {%d,%d,%d,%d,%d}\n", b[0], b[1], b[2], b[3], b[4]);
// done
return 0;
}

36
Execution Result
▪ Compile the source code
o nvcc memcpy.cu -o ./memcpy

▪ executing ./memcpy
a = {1,2,3,4,5}
b = {1,2,3,4,5}

37
Agenda

Error Checking

38
Error Checking and Handling in CUDA
▪ It is important for a program to check and handle errors
▪ CUDA API functions return flags that indicate whether an error has
occurred

▪ Most of them returns error code (or cudaSuccess).


o cudaError_t cudaMalloc(void** devPtr, size_t size);
o cudaError_t cudaFree(void* devPtr);
o cudaError_t cudaMemcpy(void* dst, const void* src, size_t size);

▪ Example:
o if (cudaMalloc(&devPtr, SIZE) != cudaSuccess) {
exit(1);
}

39
cudaError_t : data type
▪ typedef enum cudaError cudaError_t
▪ possible values:
o cudaSuccess, cudaErrorMissingConfiguration, cudaErrorMemoryAllocation,
cudaErrorInitializationError, cudaErrorLaunchFailure, cudaErrorLaunchTimeout,
cudaErrorLaunchOutOfResources, cudaErrorInvalidDeviceFunction,
cudaErrorInvalidConfiguration, cudaErrorInvalidDevice, cudaErrorInvalidValue,
cudaErrorInvalidPitchValue, cudaErrorInvalidSymbol, cudaErrorUnmapBufferObjectFailed,
cudaErrorInvalidHostPointer, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture,
cudaErrorInvalidTextureBinding, cudaErrorInvalidChannelDescriptor,
cudaErrorInvalidMemcpyDirection, cudaErrorInvalidFilterSetting,
cudaErrorInvalidNormSetting, cudaErrorUnknown, cudaErrorNotYetImplemented,
cudaErrorInvalidResourceHandle, cudaErrorInsufficientDriver, cudaErrorSetOnActiveProcess,
cudaErrorStartupFailure, cudaErrorApiFailureBase

40
cudaGetErrorName( err )
▪ const char* cudaGetErrorName( cudaError_t err )
o err : error code to convert to string
o returns:
• char* to a NULL-terminated string
• NULL if the error code is not valid

▪ cout << cudaGetErrorName( cudaErrorMemoryAllocation )


<< endl;
▪ cout << cudaGetErrorName( cudaErrorInvalidValue )
<< endl;
o shows:
cudaErrorMemoryAllocation
cudaErrorInvalidValue

41
cudaGetErrorString( err )
▪ const char* cudaGetErrorString( cudaError_t err )
o err : error code to convert to string
o returns:
• char* to a NULL-terminated string
• NULL if the error code is not valid

▪ cout << cudaGetErrorString( cudaErrorMemoryAllocation )


<< endl;
▪ cout << cudaGetErrorString( cudaErrorInvalidValue )
<< endl;
o shows:
out of memory
invalid argument

42
cudaGetLastError( void )
▪ cudaError_t cudaGetLastError( void)
o returns the last error due to CUDA runtime calls in the same host thread
o and resets it to cudaSuccess
o So, if no CUDA error since the last call, it returns cudaSuccess
o For multiple errors, it contains the last error only.

▪ cudaError_t cudaPeekAtLastError( void )


o returns the last error due to CUDA runtime calls in the same host thread
o Note that this call does NOT reset
o So, the last error code is still available

43
A simple CUDA error check code
cudaMemcpy( … );
cudaError_t e = cudaGetLastError();
if (e != cudaSuccess) {
printf(“cuda failure %s:%d: '%s'\n",
__FILE__, __LINE__,
cudaGetErrorString(e) );
exit(0);
}

44
A simple CUDA error check macro
#define cudaCheckError( ) do { \
cudaError_t e = cudaGetLastError(); \
if (e != cudaSuccess) { \
printf(“cuda failure %s:%d: '%s'\n", \
__FILE__, __LINE__, \
cudaGetErrorString(e) ); \
exit(0); \
}\
} while (0)

45
Example
▪ code segment

// allocate device memory


cudaMalloc((void**)&dev_a, sizeof(int));
cudaMalloc((void**)&dev_b, sizeof(int));
cudaMalloc((void**)&dev_c, sizeof(int));
cudaCheckError( );

46
More advanced macro
#ifdef DEBUG // debug mode
#define CUDA_CHECK(x) do {\
(x); \
cudaError_t e = cudaGetLastError(); \
if (cudaSuccess != e) { \
printf("cuda failure %s at %s:%d\n", \
cudaGetErrorString(e), \
__FILE__, __LINE__); \
exit(1); \
}\
} while (0)
#else
#define CUDA_CHECK(x) (x) // release mode
#endif

47
error_check.cu
#include <iostream>

#ifdef DEBUG
#define CUDA_CHECK(x) do {\
(x); \
cudaError_t e = cudaGetLastError(); \
if (cudaSuccess != e) { \
printf("cuda failure \"%s\" at %s:%d\n", \
cudaGetErrorString(e), \
__FILE__, __LINE__); \
exit(1); \
}\
} while (0)
#else
#define CUDA_CHECK(x) (x)
#endif

48
error_check.cu
// main program for the CPU
int main(void) {
// host-side data
const int SIZE = 5;
const int a[SIZE] = { 1, 2, 3, 4, 5 };
int b[SIZE] = { 0, 0, 0, 0, 0 };
// print source
printf("a = {%d,%d,%d,%d,%d}\n", a[0], a[1], a[2], a[3], a[4]);
// device-side data
int *dev_a = 0;
int *dev_b = 0;
// allocate device memory
CUDA_CHECK( cudaMalloc((void**)&dev_a, SIZE * sizeof(int)) );
CUDA_CHECK( cudaMalloc((void**)&dev_b, SIZE * sizeof(int)) );
// copy from host to device
CUDA_CHECK( cudaMemcpy(dev_a, a, SIZE * sizeof(int), cudaMemcpyDeviceToDevice) ); // BOMB here !
// copy from device to device
CUDA_CHECK( cudaMemcpy(dev_b, dev_a, SIZE * sizeof(int), cudaMemcpyDeviceToDevice) );
// copy from device to host
CUDA_CHECK( cudaMemcpy(b, dev_b, SIZE * sizeof(int), cudaMemcpyDeviceToHost) );
// free device memory
CUDA_CHECK( cudaFree(dev_a) );
CUDA_CHECK( cudaFree(dev_b) );
// print the result
printf("b = {%d,%d,%d,%d,%d}\n", b[0], b[1], b[2], b[3], b[4]);
// done
return 0;
}

49
Execution Result
▪ Compile the source code
o nvcc error_check.cu -DDEBUG -o ./error_check
o nvcc error_check.cu -o ./error_check

▪ executing ./error_check
a = {1,2,3,4,5}
b = {………..}

50
Agenda
▪ What is CUDA?
▪ Device Global Memory and Data Transfer
▪ Error Checking
▪ A Vector Addition Kernel
▪ Kernel Functions and Threading
▪ Kernel Launch

51
CUDA Resources
▪ CUDA API reference:
o http://docs.nvidia.com/cuda/index.html
o http://docs.nvidia.com/cuda/cuda-runtime-api/index.html
▪ CUDA course:
o https://developer.nvidia.com/cuda-education-training
o https://developer.nvidia.com/cuda-training

52

You might also like