0% found this document useful (0 votes)
5 views20 pages

CUDA

CUDA is a parallel computing platform and API developed by Nvidia that enables software to utilize GPUs for general-purpose processing, enhancing performance in various computational tasks. It supports programming languages like C, C++, Fortran, and Python, and includes libraries and tools for developers to optimize their applications. CUDA has advantages over traditional GPGPU methods, but also has limitations such as interoperability issues and performance constraints related to memory transfers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

CUDA

CUDA is a parallel computing platform and API developed by Nvidia that enables software to utilize GPUs for general-purpose processing, enhancing performance in various computational tasks. It supports programming languages like C, C++, Fortran, and Python, and includes libraries and tools for developers to optimize their applications. CUDA has advantages over traditional GPGPU methods, but also has limitations such as interoperability issues and performance constraints related to memory transfers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CUDA

In computing, CUDA is a proprietary[2] parallel computing platform and application programming interface
CUDA
(API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-
purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia
in 2006.[3] When it was first introduced, the name was an acronym for Compute Unified Device
Architecture,[4] but Nvidia later dropped the common use of the acronym and now rarely expands it.[5]

CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel
computational elements for the execution of compute kernels.[6] In addition to drivers and runtime kernels,
the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their
Developer(s) Nvidia
applications.
Initial release February 16, 2007[1]
CUDA is designed to work with programming languages such as C, C++, Fortran and Python. This Stable release 12.8 / January 2025
accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior
Operating system Windows, Linux
APIs like Direct3D and OpenGL, which require advanced skills in graphics programming.[7] CUDA-powered
GPUs also support programming frameworks such as OpenMP, OpenACC and OpenCL.[8][6] Platform Supported GPUs
Type GPGPU
License Proprietary
Background Website developer.nvidia.com
/cuda-zone (https://develop
The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of real-time
er.nvidia.com/cuda-zone)
high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs had evolved into highly parallel multi-
core systems allowing efficient manipulation of large blocks of data. This design is more effective than
general-purpose central processing unit (CPUs) for algorithms in situations where processing large blocks of data is done in parallel, such as:

cryptographic hash functions


machine learning
molecular dynamics simulations
physics engines
Ian Buck, while at Stanford in 2000, created an 8K gaming rig using 32 GeForce cards, then obtained a DARPA grant to perform general purpose parallel
programming on GPUs. He then joined Nvidia, where since 2004 he has been overseeing CUDA development. In pushing for CUDA, Jensen Huang aimed for
the Nvidia GPUs to become a general hardware for scientific computing. CUDA was released in 2007. Around 2015, the focus of CUDA changed to neural
networks.[9]

Ontology
The following table offers a non-exact description for the ontology of CUDA framework.

The ontology of CUDA framework


computation
memory memory (code, or computation computation
(code
(hardware) variable scoping) (hardware) (code semantics)
syntax)

RAM non-CUDA variables host program one routine call


VRAM,
simultaneous call of the same subroutine
GPU L2 global, const, texture device grid
on many processors
cache

GPU L1
local, shared SM ("streaming multiprocessor") block individual subroutine call
cache

warp = 32 threads SIMD instructions


GPU L0
thread (aka. "SP", "streaming processor", "cuda core", but analogous to individual scalar ops within a
cache,
these names are now deprecated) vector op
register

Programming abilities
The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives such as OpenACC, and extensions to
industry-standard programming languages including C, C++, Fortran and Python. C/C++ programmers can use 'CUDA C/C++', compiled to PTX with nvcc,
Nvidia's LLVM-based C/C++ compiler, or by clang itself.[10] Fortran programmers can use 'CUDA Fortran', compiled with the PGI CUDA Fortran compiler
from The Portland Group. Python programmers can use the cuNumeric library to accelerate applications on Nvidia GPUs.

In addition to libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA platform supports other computational interfaces, including the
Khronos Group's OpenCL,[11] Microsoft's DirectCompute, OpenGL Compute Shader and C++ AMP.[12] Third party wrappers are also available for Python,
Perl, Fortran, Java, Ruby, Lua, Common Lisp, Haskell, R, MATLAB, IDL, Julia, and native support in Mathematica.
In the computer game industry, GPUs are used for graphics rendering, and for game physics calculations
(physical effects such as debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also
been used to accelerate non-graphical applications in computational biology, cryptography and other fields
by an order of magnitude or more.[13][14][15][16][17]

CUDA provides both a low level API (CUDA Driver API, non single-source) and a higher level API
(CUDA Runtime API, single-source). The initial CUDA SDK was made public on 15 February 2007, for
Microsoft Windows and Linux. Mac OS X support was later added in version 2.0,[18] which supersedes the
beta released February 14, 2008.[19] CUDA works with all Nvidia GPUs from the G8x series onwards,
including GeForce, Quadro and the Tesla line. CUDA is compatible with most standard operating systems.

CUDA 8.0 comes with the following libraries (for compilation & runtime, in alphabetical order):

cuBLAS – CUDA Basic Linear Algebra Subroutines library


CUDART – CUDA Runtime library
cuFFT – CUDA Fast Fourier Transform library Example of CUDA processing flow
cuRAND – CUDA Random Number Generation library 1. Copy data from main memory to GPU
cuSOLVER – CUDA based collection of dense and sparse direct solvers memory
cuSPARSE – CUDA Sparse Matrix library 2. CPU initiates the GPU compute kernel
NPP – NVIDIA Performance Primitives library 3. GPU's CUDA cores execute the kernel
nvGRAPH – NVIDIA Graph Analytics library in parallel
NVML – NVIDIA Management Library 4. Copy the resulting data from GPU
NVRTC – NVIDIA Runtime Compilation library for CUDA C++ memory to main memory
CUDA 8.0 comes with these other software components:

nView – NVIDIA nView Desktop Management Software


NVWMI – NVIDIA Enterprise Management Toolkit
GameWorks PhysX – is a multi-platform game physics engine
CUDA 9.0–9.2 comes with these other components:

CUTLASS 1.0 – custom linear algebra algorithms,


NVIDIA Video Decoder was deprecated in CUDA 9.2; it is now available in NVIDIA Video Codec SDK
CUDA 10 comes with these other components:

nvJPEG – Hybrid (CPU and GPU) JPEG processing


CUDA 11.0–11.8 comes with these other components:[20][21][22][23]

CUB is new one of more supported C++ libraries


MIG multi instance GPU support
nvJPEG2000 – JPEG 2000 encoder and decoder

Advantages
CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs:

Scattered reads – code can read from arbitrary addresses in memory.


Unified virtual memory (CUDA 4.0 and above)
Unified memory (CUDA 6.0 and above)
Shared memory – CUDA exposes a fast shared memory region that can be shared among threads. This can be used as a user-managed
cache, enabling higher bandwidth than is possible using texture lookups.[24]
Faster downloads and readbacks to and from the GPU
Full support for integer and bitwise operations, including integer texture lookups

Limitations
Whether for the host computer or the GPU device, all CUDA source code is now processed according to C++ syntax rules.[25] This was not
always the case. Earlier versions of CUDA were based on C syntax rules.[26] As with the more general case of compiling C code with a C++
compiler, it is therefore possible that old C-style CUDA source code will either fail to compile or will not behave as originally intended.
Interoperability with rendering languages such as OpenGL is one-way, with OpenGL having access to registered CUDA memory but CUDA
not having access to OpenGL memory.
Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly
alleviated with asynchronous memory transfers, handled by the GPU's DMA engine).
Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches
in the program code do not affect performance significantly, provided that each of 32 threads takes the same execution path; the SIMD
execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning data structure during
ray tracing).
No emulation or fallback functionality is available for modern revisions.
Valid C++ may sometimes be flagged and prevent compilation due to the way the compiler approaches optimization for target GPU device
limitations.
C++ run-time type information (RTTI) and C++-style exception handling are only supported in host code, not in device code.
In single-precision on first generation CUDA compute capability 1.x devices, denormal numbers are unsupported and are instead flushed to
zero, and the precision of both the division and square root operations are slightly lower than IEEE 754-compliant single precision math.
Devices that support compute capability 2.0 and above support denormal numbers, and the division and square root operations are IEEE 754
compliant by default. However, users can obtain the prior faster gaming-grade math of compute capability 1.x devices if desired by setting
compiler flags to disable accurate divisions and accurate square roots, and enable flushing denormal numbers to zero.[27]
Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia as it is proprietary.[28][2] Attempts to implement CUDA on other GPUs
include:

Project Coriander: Converts CUDA C++11 source to OpenCL 1.2 C. A fork of CUDA-on-CL intended to run TensorFlow.[29][30][31]
CU2CL: Convert CUDA 3.2 C++ to OpenCL C.[32]
GPUOpen HIP: A thin abstraction layer on top of CUDA and ROCm intended for AMD and Nvidia GPUs. Has a conversion tool for
importing CUDA C++ source. Supports CUDA 4.0 plus C++11 and float16.
ZLUDA is a drop-in replacement for CUDA on AMD GPUs and formerly Intel GPUs with near-native performance.[33] The developer,
Andrzej Janik, was separately contracted by both Intel and AMD to develop the software in 2021 and 2022, respectively. However, neither
company decided to release it officially due to the lack of a business use case. AMD's contract included a clause that allowed Janik to
release his code for AMD independently, allowing him to release the new version that only supports AMD GPUs.[34]
chipStar can compile and run CUDA/HIP programs on advanced OpenCL 3.0 or Level Zero platforms.[35]

Example
This example code in C++ loads a texture from an image into an array on the GPU:

texture<float, 2, cudaReadModeElementType> tex;

void foo()
{
cudaArray* cu_array;

// Allocate array
cudaChannelFormatDesc description = cudaCreateChannelDesc<float>();
cudaMallocArray(&cu_array, &description, width, height);

// Copy image data to array


cudaMemcpyToArray(cu_array, image, width*height*sizeof(float), cudaMemcpyHostToDevice);

// Set texture parameters (default)


tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false; // do not normalize coordinates

// Bind the array to the texture


cudaBindTextureToArray(tex, cu_array);

// Run kernel
dim3 blockDim(16, 16, 1);
dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);

// Unbind the array from the texture


cudaUnbindTexture(tex);
} //end foo()

__global__ void kernel(float* odata, int height, int width)


{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if (x < width && y < height) {
float c = tex2D(tex, x, y);
odata[y*width+x] = c;
}
}

Below is an example given in Python that computes the product of two arrays on the GPU. The unofficial Python language bindings can be obtained from
PyCUDA.[36]

import pycuda.compiler as comp


import pycuda.driver as drv
import numpy
import pycuda.autoinit

mod = comp.SourceModule(
"""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
"""
)

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(drv.Out(dest), drv.In(a), drv.In(b), block=(400, 1, 1))

print(dest - a * b)

Additional Python bindings to simplify matrix multiplication operations can be found in the program pycublas.[37]
import numpy
from pycublas import CUBLASMatrix

A = CUBLASMatrix(numpy.mat([[1, 2, 3], [4, 5, 6]], numpy.float32))


B = CUBLASMatrix(numpy.mat([[2, 3], [4, 5], [6, 7]], numpy.float32))
C = A * B
print(C.np_mat())

while CuPy directly replaces NumPy:[38]

import cupy

a = cupy.random.randn(400)
b = cupy.random.randn(400)

dest = cupy.zeros_like(a)

print(dest - a * b)

GPUs supported
Supported CUDA compute capability versions for CUDA SDK version and microarchitecture (by code name):

Compute capability (CUDA SDK support vs. microarchitecture)


CUDA SDK Kepler Kepler Ada
Tesla Fermi Maxwell Pascal Volta Turing Ampere Hopper Blackwell
version(s) (early) (late) Lovelace

1.0[39] 1.0 – 1.1

1.1 1.0 – 1.1+x

2.0 1.0 – 1.1+x

2.1 – 2.3.1[40][41][42][43] 1.0 – 1.3


[44][45] 1.0 2.0
3.0 – 3.1

3.2[46] 1.0 2.1

4.0 – 4.2 1.0 2.1

5.0 – 5.5 1.0 3.5

6.0 1.0 3.2 3.5


6.5 1.1 3.7 5.x

7.0 – 7.5 2.0 5.x

8.0 2.0 6.x


9.0 – 9.2 3.0 7.0 – 7.2

10.0 – 10.2 3.0 7.5


[47] 3.5 8.0
11.0

11.1 – 11.4[48] 3.5 8.6


[49] 3.5 8.7
11.5 – 11.7.1
[50] 3.5 8.9 9.0
11.8
12.0 – 12.6 5.0 9.0

12.8 5.0 12.0

Note: CUDA SDK 10.2 is the last official release for macOS, as support will not be available for macOS in newer releases.

CUDA compute capability by version with associated GPU semiconductors and GPU card models (separated by their various application areas):
Compute capability, GPU semiconductors and Nvidia GPU board products
Compute Tegra,
Micro-
capability GPUs GeForce Quadro, NVS Tesla/Datacenter Jetson,
architecture
(version) DRIVE
GeForce 8800 Ultra, Quadro FX 5600,
GeForce 8800 GTX, Quadro FX 4600, Tesla C870, Tesla D870, Tesla
1.0 G80
GeForce 8800 Quadro Plex 2100 S870
GTS(G80) S4
Quadro FX 4700
X2, Quadro FX
3700, Quadro FX
1800, Quadro FX
1700, Quadro FX
580, Quadro FX
570, Quadro FX
470, Quadro FX
380, Quadro FX
370, Quadro FX
370 Low Profile,
Quadro NVS 450,
Quadro NVS 420,
GeForce GTS 250, Quadro NVS 290,
GeForce 9800 GX2, Quadro NVS 295,
GeForce 9800 GTX, Quadro Plex 2100
GeForce 9800 GT, D4,
GeForce 8800 Quadro FX
GTS(G92), GeForce 3800M, Quadro
8800 GT, GeForce FX 3700M,
9600 GT, GeForce Quadro FX
9500 GT, GeForce 3600M, Quadro
G92, G94, G96, G98, G84,
1.1 9400 GT, GeForce FX 2800M,
G86
8600 GTS, GeForce Quadro FX
8600 GT, GeForce 2700M, Quadro
8500 GT, FX 1700M,
GeForce G110M, Quadro FX
GeForce 9300M GS, 1600M, Quadro
GeForce 9200M GS, FX 770M, Quadro
GeForce 9100M G, FX 570M, Quadro
GeForce 8400M GT, FX 370M, Quadro
GeForce G105M FX 360M, Quadro
Tesla
NVS 320M,
Quadro NVS
160M, Quadro
NVS 150M,
Quadro NVS
140M, Quadro
NVS 135M,
Quadro NVS
130M, Quadro
NVS 450, Quadro
NVS 420,[51]
Quadro NVS 295
GeForce GT 340*,
GeForce GT 330*,
GeForce GT 320*,
GeForce 315*, GeForce Quadro FX 380
310*, GeForce GT 240, Low Profile,
GeForce GT 220, Quadro FX
GeForce 210, 1800M, Quadro
GeForce GTS 360M, FX 880M, Quadro
1.2 GT218, GT216, GT215
GeForce GTS 350M, FX 380M,
GeForce GT 335M, Nvidia NVS 300,
GeForce GT 330M, NVS 5100M, NVS
GeForce GT 325M, 3100M, NVS
GeForce GT 240M, 2100M, ION
GeForce G210M,
GeForce 310M,
GeForce 305M
Quadro FX 5800,
Quadro FX 4800,
GeForce GTX 295,
Quadro FX 4800
GTX 285, GTX 280, Tesla C1060, Tesla S1070,
1.3 GT200, GT200b for Mac, Quadro
GeForce GTX 275, Tesla M1060
FX 3800, Quadro
GeForce GTX 260
CX, Quadro Plex
2200 D2
Fermi Quadro 6000,
GeForce GTX 590,
Quadro 5000,
GeForce GTX 580,
Quadro 4000,
GeForce GTX 570, Tesla C2075, Tesla
Quadro 4000 for
2.0 GF100, GF110 GeForce GTX 480, C2050/C2070, Tesla
Mac, Quadro Plex
GeForce GTX 470, M2050/M2070/M2075/M2090
7000,
GeForce GTX 465,
Quadro 5010M,
GeForce GTX 480M
Quadro 5000M
GeForce GTX 560 Ti,
GeForce GTX 550 Ti,
GeForce GTX 460,
GeForce GTS 450,
GeForce GTS 450*,
GeForce GT 640
(GDDR3), GeForce GT
630, GeForce GT 620,
GeForce GT 610,
GeForce GT 520,
GeForce GT 440,
GeForce GT 440*,
GeForce GT 430,
GeForce GT 430*,
GeForce GT 420*,
GeForce GTX 675M,
GeForce GTX 670M, Quadro 2000,
GeForce GT 635M, Quadro 2000D,
GeForce GT 630M, Quadro 600,
GeForce GT 625M, Quadro 4000M,
GF104, GF106 GF108, GeForce GT 720M, Quadro 3000M,
2.1 GF114, GF116, GF117, GeForce GT 620M, Quadro 2000M,
GF119 GeForce 710M, Quadro 1000M,
GeForce 610M, NVS 310, NVS
GeForce 820M, 315, NVS 5400M,
GeForce GTX 580M, NVS 5200M, NVS
GeForce GTX 570M, 4200M
GeForce GTX 560M,
GeForce GT 555M,
GeForce GT 550M,
GeForce GT 540M,
GeForce GT 525M,
GeForce GT 520MX,
GeForce GT 520M,
GeForce GTX 485M,
GeForce GTX 470M,
GeForce GTX 460M,
GeForce GT 445M,
GeForce GT 435M,
GeForce GT 420M,
GeForce GT 415M,
GeForce 710M,
GeForce 410M
GeForce GTX 770,
GeForce GTX 760,
GeForce GT 740,
GeForce GTX 690,
GeForce GTX 680,
GeForce GTX 670,
Quadro K5000,
GeForce GTX 660 Ti,
Quadro K4200,
GeForce GTX 660,
Quadro K4000,
GeForce GTX 650 Ti
Quadro K2000,
BOOST, GeForce GTX
Quadro K2000D,
650 Ti, GeForce GTX
Quadro K600,
650,
Quadro K420,
GeForce GTX 880M,
Quadro K500M,
GeForce GTX 870M,
Quadro K510M,
GeForce GTX 780M,
Quadro K610M,
GeForce GTX 770M,
Quadro K1000M, Tesla K10, GRID K340, GRID
3.0 GK104, GK106, GK107 GeForce GTX 765M,
Quadro K2000M, K520, GRID K2
GeForce GTX 760M,
Quadro K1100M,
GeForce GTX 680MX,
Quadro K2100M,
GeForce GTX 680M,
Quadro K3000M,
GeForce GTX 675MX,
Quadro K3100M,
GeForce GTX 670MX,
Quadro K4000M,
GeForce GTX 660M,
Quadro K5000M,
GeForce GT 750M,
Quadro K4100M,
GeForce GT 650M,
Quadro K5100M,
Kepler GeForce GT 745M,
NVS 510, Quadro
GeForce GT 645M,
410
GeForce GT 740M,
GeForce GT 730M,
GeForce GT 640M,
GeForce GT 640M LE,
GeForce GT 735M,
GeForce GT 730M
Tegra K1,
3.2 GK20A
Jetson TK1
GeForce GTX Titan Z,
GeForce GTX Titan
Black, GeForce GTX
Titan, GeForce GTX
780 Ti, GeForce GTX
780, GeForce GT 640
Quadro K6000, Tesla K40, Tesla K20x, Tesla
3.5 GK110, GK208 (GDDR5), GeForce GT
Quadro K5200 K20
630 v2, GeForce GT
730, GeForce GT 720,
GeForce GT 710,
GeForce GT 740M (64-
bit, DDR3), GeForce
GT 920M
3.7 GK210 Tesla K80
5.0 Maxwell GM107, GM108 GeForce GTX 750 Ti, Quadro K1200, Tesla M10
GeForce GTX 750, Quadro K2200,
GeForce GTX 960M, Quadro K620,
GeForce GTX 950M, Quadro M2000M,
GeForce 940M, Quadro M1000M,
GeForce 930M, Quadro M600M,
GeForce GTX 860M, Quadro K620M,
GeForce GTX 850M, NVS 810
GeForce 845M,
GeForce 840M,
GeForce 830M
GeForce GTX Titan X, Quadro M6000
GeForce GTX 980 Ti, 24GB, Quadro
GeForce GTX 980, M6000, Quadro
GeForce GTX 970, M5000, Quadro
GeForce GTX 960, M4000, Quadro Tesla M4, Tesla M40, Tesla
5.2 GM200, GM204, GM206
GeForce GTX 950, M2000, Quadro M6, Tesla M60
GeForce GTX 750 SE, M5500,
GeForce GTX 980M, Quadro M5000M,
GeForce GTX 970M, Quadro M4000M,
GeForce GTX 965M Quadro M3000M
Tegra X1,
Jetson TX1,
5.3 GM20B Jetson Nano,
DRIVE CX,
DRIVE PX
6.0 GP100 Quadro GP100 Tesla P100
Quadro P6000,
Quadro P5000,
Nvidia TITAN Xp, Titan
Quadro P4000,
X,
Quadro P2200,
GeForce GTX 1080 Ti,
Quadro P2000,
GTX 1080, GTX 1070
Quadro P1000,
Ti, GTX 1070, GTX
Quadro P400,
GP102, GP104, GP106, 1060,
6.1 Quadro P500, Tesla P40, Tesla P6, Tesla P4
Pascal GP107, GP108 GTX 1050 Ti, GTX
Quadro P520,
1050, GT 1030, GT
Quadro P600,
1010,
Quadro P5000
MX350, MX330,
(mobile), Quadro
MX250, MX230,
P4000 (mobile),
MX150, MX130, MX110
Quadro P3000
(mobile)
Tegra X2, Jetson TX2,
6.2 GP10B[52] DRIVE PX 2
7.0 GV100 NVIDIA TITAN V Quadro GV100 Tesla V100, Tesla V100S
Tegra Xavier,
Volta GV10B[53] Jetson Xavier NX,
Jetson AGX Xavier,
7.2 GV11B[54][55] DRIVE AGX Xavier,
DRIVE AGX Pegasus,
Clara AGX
Quadro RTX
NVIDIA TITAN RTX,
8000, Quadro
GeForce RTX 2080 Ti,
RTX 6000,
RTX 2080 Super, RTX
Quadro RTX
2080, RTX 2070 Super,
5000, Quadro
RTX 2070, RTX 2060
RTX 4000, T1000,
TU102, TU104, TU106, Super, RTX 2060
7.5 Turing T600, T400 Tesla T4
TU116, TU117 12GB, RTX 2060,
T1200 (mobile),
GeForce GTX 1660 Ti,
T600 (mobile),
GTX 1660 Super, GTX
T500 (mobile),
1660, GTX 1650 Super,
Quadro T2000
GTX 1650, MX550,
(mobile), Quadro
MX450
T1000 (mobile)
8.0 GA100 A100 80GB, A100 40GB, A30
RTX A6000, RTX
GeForce RTX 3090 Ti, A5500, RTX
RTX 3090, RTX 3080 A5000, RTX
Ti, RTX 3080 12GB, A4500, RTX
RTX 3080, RTX 3070 A4000, RTX
GA102, GA103, GA104, Ti, RTX 3070, RTX A2000
8.6 A40, A16, A10, A2
GA106, GA107 3060 Ti, RTX 3060, RTX A5000
RTX 3050, RTX 3050 Ti (mobile), RTX
Ampere (mobile), RTX 3050 A4000 (mobile),
(mobile), RTX 2050 RTX A3000
(mobile), MX570 (mobile), RTX
A2000 (mobile)
Jetson Orin Nano,
Jetson Orin NX,
Jetson AGX Orin,
8.7 GA10B
DRIVE AGX Orin,
DRIVE AGX Pegasus OA,
Clara Holoscan
GeForce RTX 4090, RTX 6000 Ada,
RTX 4080 Super, RTX RTX 5880 Ada,
4080, RTX 4070 Ti RTX 5000 Ada,
Ada AD102, AD103, AD104, Super, RTX 4070 Ti, RTX 4500 Ada,
8.9 L40S, L40, L20, L4, L2
Lovelace[56] AD106, AD107 RTX 4070 Super, RTX RTX 4000 Ada,
4070, RTX 4060 Ti, RTX 4000 SFF,
RTX 4060, RTX 4050 RTX 3500 Ada
(mobile) (mobile)
9.0 Hopper GH100 H200, H100
10.0 GB100 B200, B100, GB200 (?)
10.1 G10 (?) GB10 (?)
GeForce RTX 5090,
Blackwell GB202, GB203, GB205,
12.0 RTX 5080, RTX 5070 B40
GB206, GB207
Ti, RTX 5070
Jetson Thor (?), AGX Thor
12.x (?)
(?), Drive Thor (?)
Compute Tegra,
Micro-
capability GPUs GeForce Quadro, NVS Tesla/Datacenter Jetson,
architecture
(version) DRIVE

* – OEM-only products

Version features and specifications

Compute capability (version)


Feature support (unlisted features are supported for all compute
capabilities) 1.0, 1.2, 3.5, 3.7, 5.x, 6.x, 9.0, 10.x,
2.x 3.0 3.2 7.5 8.x
1.1 1.3 7.0, 7.2 12.0
Warp vote functions (__all(), __any()) No Yes
Warp vote functions (__ballot())
Memory fence functions (__threadfence_system())
Synchronization functions (__syncthreads_count(), __syncthreads_and(),
No Yes
__syncthreads_or())
Surface functions
3D grid of thread blocks
Warp shuffle functions
No Yes
Unified memory programming
Funnel shift No Yes
Dynamic parallelism No Yes

Uniform Datapath[57] No Yes

Hardware-accelerated async-copy
Hardware-accelerated split arrive/wait barrier
No Yes
Warp-level support for reduction ops
L2 cache residency management
DPX instructions for accelerated dynamic programming
Distributed shared memory
No Yes
Thread block cluster
Tensor memory accelerator (TMA) unit
3.5, 3.7, 5.x, 6.x, 9.0, 10.x,
Feature support (unlisted features are supported for all compute 1.0,1.1 1.2,1.3 2.x 3.0 3.2 7.5 8.x
7.0, 7.2 12.0
capabilities)
Compute capability (version)
[58]

Data types

Floating-point types

Storage Length
Used Length
Supported vector Bits Sign Exponent Mantissa
Data type Bits Comments
types (complete Bits Bits Bits
(single value)
vector)
E2M1 = FP4 e2m1x2 / e2m1x4 8 / 16 4 1 2 1
E2M3 = FP6
e2m3x2 / e2m3x4 16 / 32 6 1 2 3
variant
E3M2 = FP6
e3m2x2 / e3m2x4 16 / 32 6 1 3 2
variant
UE4M3 ue4m3 8 7 0 4 3 Used for scaling (E2M1 only)
E4M3 = FP8 e4m3 / e4m3x2 /
8 / 16 / 32 8 1 4 3
variant e4m3x4
E5M2 = FP8 e5m2 / e5m2x2 /
8 / 16 / 32 8 1 5 2 Exponent/range of FP16, fits into 8 bits
variant e5m2x4
Used for scaling (any FP4 or FP6 or FP8
UE8M0 ue8m0x2 16 8 0 8 0
format)
FP16 f16 / f16x2 16 / 32 16 1 5 10
BF16 bf16 / bf16x2 16 / 32 16 1 8 7 Exponent/range of FP32, fits into 16 bits
Exponent/range of FP32,
TF32 tf32 32 19 1 8 10
mantissa/precision of FP16
FP32 f32 / f32x2 32 / 64 32 1 8 23
FP64 f64 64 64 1 11 52

Version support

Supported since Supported since


Data type Basic Operations Supported since Atomic Operations
for global memory for shared memory
8-bit integer
loading, storing, conversion 1.0 — —
signed/unsigned
16-bit integer
general operations 1.0 atomicCAS() 3.5
signed/unsigned
32-bit integer
general operations 1.0 atomic functions 1.1 1.2
signed/unsigned
64-bit integer
general operations 1.0 atomic functions 1.2 2.0
signed/unsigned
any 128-bit trivially copyable type general operations No atomicExch, atomicCAS 9.0
addition, subtraction, half2 atomic addition 6.0
16-bit floating point
multiplication, comparison, 5.3
FP16 atomic addition 7.0
warp shuffle functions, conversion
addition, subtraction,
16-bit floating point
multiplication, comparison, 8.0 atomic addition 8.0
BF16
warp shuffle functions, conversion
atomicExch() 1.1 1.2
32-bit floating point general operations 1.0
atomic addition 2.0
32-bit floating point float2 and float4 general operations No atomic addition 9.0
64-bit floating point general operations 1.3 atomic addition 6.0
Note: Any missing lines or empty entries do reflect some lack of information on that exact item.[59]

Tensor cores

FMA per
cycle per 7.5 7.5 8.6 8.6 8.9 8.9
tensor Supported since 7.0 7.2 8.0 8.7 9.0
Workstation Desktop Workstation Desktop Desktop Workstation
core[60]

1st 1st
For dense For sparse
Data Type Gen Gen? 2nd Gen (8x/SM) 3rd Gen (4x/SM) 4th Gen (4x/SM)
matrices matrices
(8x/SM) (8x/SM)
1-bit values 8.0 as
No
(AND) experimental
No 4096 2048
1-bit values
1024
(XOR) 7.5–8.9 as
No Deprec
4-bit experimental 8.0–8.9 as
256 1024 512
integers experimental
4-bit
floating
10.0 No 4
point FP4
(E2M1)
6-bit
floating
point FP6 10.0 No 2
(E3M2 and
E2M3)
8-bit
7.2 8.0 No 128 128 512 256
integers
8-bit
floating
point FP8
(E4M3 and 256
E5M2) with
FP16
1024 2
accumulate
8.9 No
8-bit
floating
point FP8
(E4M3 and 128
E5M2) with
FP32
accumulate
16-bit
floating
point FP16 64 128
with FP16
accumulate
7.0 8.0 64 64
16-bit
floating
point FP16 256 512 1
with FP32
accumulate
64 128
16-bit
floating
32
point BF16 64[62]
with FP32
accumulate
7.5[61] 8.0
32-bit (19
No speed tbd
bits used)
128 32 64 256
floating (32?)[62]
point TF32
64-bit
floating 8.0 No No 16 speed tbd 32
point

Note: Any missing lines or empty entries do reflect some lack of information on that exact item.[63][64] [65] [66] [67] [68]

Tensor Core Composition 7.0 7.2, 7.5 8.0, 8.6 8.7 8.9 9.0
[69][70][71][72] 4 (8) 8 (16) 4 (8) 16 (32)
Dot Product Unit Width in FP16 units (in bytes)

Dot Product Units per Tensor Core 16 32


Tensor Cores per SM partition 2 1
[73] [74] 256 512 256 1024
Full throughput (Bytes/cycle) per SM partition

FP Tensor Cores: Minimum cycles for warp-wide matrix calculation 8 4 8

FP Tensor Cores: Minimum Matrix Shape for full throughput (Bytes)[75] 2048

INT Tensor Cores: Minimum cycles for warp-wide matrix calculation No 4


INT Tensor Cores: Minimum Matrix Shape for full throughput (Bytes) No 1024 2048 1024
[76][77][78][79]
FP64 Tensor Core Composition 8.0 8.6 8.7 8.9 9.0
Dot Product Unit Width in FP64 units (in bytes) 4 (32) tbd 4 (32)
Dot Product Units per Tensor Core 4 tbd 8
Tensor Cores per SM partition 1

Full throughput (Bytes/cycle)[73] per SM partition[74] 128 tbd 256

Minimum cycles for warp-wide matrix calculation 16 tbd


[75] 2048
Minimum Matrix Shape for full throughput (Bytes)

Technical specification

Technical Compute capability (version)


specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 6.2 7.0 7.2 7.5
Maximum
number of
resident grids
per device
(concurrent
1 16 4 32 16 128 32 16 128 16
kernel
execution, can
be lower for
specific
devices)
Maximum
dimensionality
2 3
of grid of
thread blocks
Maximum x-
dimension of a
grid of thread
65535 231 − 1
blocks
Maximum y-,
or z-dimension
65535
of a grid of
thread blocks
Maximum
dimensionality 3
of thread block
Maximum x- or
y-dimension of 512 1024
a block
Maximum z-
dimension of a 64
block
Maximum
number of
512 1024
threads per
block
Warp size 32
Maximum
number of
resident blocks 8 16 32 16
per
multiprocessor
Maximum
number of
resident warps 24 32 48 64 32
per
multiprocessor
Maximum
number of
resident 768 1024 1536 2048 1024
threads per
multiprocessor
Number of 32-
bit regular 128
8K 16 K 32 K 64 K 64 K
registers per K
multiprocessor
Number of 32- 2 K[80
bit uniform [81]
No
registers per
multiprocessor
Maximum
number of 32-
64 32
bit registers 8K 16 K 32 K 64 K 32 K 64 K 32 K
K K
per thread
block
Maximum 124 63 255
number of 32-
bit regular
registers per
thread
Maximum
number of 32- 63[80]
bit uniform No [82]
registers per
warp
Amount of
shared 80 /
memory per 96 /
16 / 0 / 8 / 16 /
multiprocessor 112
48 KiB 16 / 32 / 48 32 / 64 /
(out of overall 16 KiB KiB 64 KiB 96 KiB 64 KiB 96 KiB 64 KiB 32 / 64 KiB (o
(of 64 KiB (of 64 KiB) 96 KiB (of
shared (of
KiB) 128 KiB)
memory + L1 128
cache, where KiB)
applicable)
Maximum
amount of
96 48
shared 16 KiB 48 KiB 64 KiB
KiB KiB
memory per
thread block
Number of
shared 16 32
memory banks
Amount of
local memory 16 KiB 512 KiB
per thread
Constant
memory size
accessible by
CUDA C/C++
(1 bank, PTX 64 KiB
can access 11
banks, SASS
can access 18
banks)
Cache working
set per
multiprocessor 8 KiB 4 KiB
for constant
memory
Cache working
set per 24 KiB
16 KiB per 32 –
multiprocessor
TPC
per 12 KiB 12 – 48 KiB[83] 24 KiB 48 KiB 32 KiB[84] 24 KiB 48 KiB 24 KiB
128 KiB
32 – 64
for texture TPC
memory
Maximum
width for 1D
texture
reference 8192 65536
bound to a
CUDA
array
Maximum
width for 1D
texture
reference 227 228 227 228 227
bound to linear
memory
Maximum
width and
number of
layers for a 1D 8192 × 512 16384 × 2048
layered
texture
reference
Maximum
width and
height for 2D
texture
65536 × 32768 65536 × 65535
reference
bound
to a CUDA
array
Maximum
width and
height for 2D
texture
65000 x 65000 65536 x 65536
reference
bound
to a linear
memory
Maximum — 16384 x 16384
width and
height for 2D
texture
reference
bound
to a CUDA
array
supporting
texture gather
Maximum
width, height,
and number of
8192 × 8192 × 512 16384 × 16384 × 2048 3
layers for a 2D
layered texture
reference
Maximum
width, height
and depth for
a 3D texture
reference 20483 40963
bound to linear
memory or a
CUDA array
Maximum
width (and
height) for a
— 16384
cubemap
texture
reference
Maximum
width (and
height) and
number of
— 16384 × 2046
layers
for a cubemap
layered texture
reference
Maximum
number of
textures that
128 256
can be bound
to a
kernel
Maximum Not
width for a 1D supported
surface
65536 16384
reference
bound to a
CUDA array
Maximum
width and
number of
layers for a 1D 65536 × 2048 16384 × 2048
layered
surface
reference
Maximum
width and
height for a 2D
surface 65536 × 32768 16384 × 65536
reference
bound to a
CUDA array
Maximum
width, height,
and number of
layers for a 2D 65536 × 32768 × 2048 16384 × 16384 × 2048 3
layered
surface
reference
Maximum
width, height,
and depth for
a 3D surface 65536 × 32768 × 2048 4096 × 4096 × 4096 16
reference
bound to a
CUDA array
Maximum
width (and
height) for a
cubemap
32768 16384
surface
reference
bound to a
CUDA array
Maximum
width and
number of
layers for a
32768 × 2046 16384 × 2046
cubemap
layered
surface
reference
Maximum 8 16
number of
surfaces that
can be bound
to a
kernel
Maximum
number of
2 million 512 million
instructions
per kernel
Maximum
number of
Thread Blocks
per Thread No
Block
Cluster[85]

Technical 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 6.2 7.0 7.2 7.5
specifications Compute capability (version)

[86]
[87]

Multiprocessor architecture

Architecture Compute capability (version


specifications 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0
Number of
ALU lanes for
INT32
arithmetic
operations
Number of
ALU lanes for
any INT32 or
FP32
8 32 48 192[88] 128
arithmetic
operation 128 64
Number of
ALU lanes for
FP32
arithmetic
operations
Number of
ALU lanes for
FP16x2 No
arithmetic
operations
Number of
ALU lanes for 16 by 4 by 8/
FP64 No 1 8 64 4[95] 32
arithmetic FP32[92] FP32[93] 64[94]
operations
4 8 8 per 2
Number of
per per SM / 3 8 per 3
Load/Store 16 32 16
2 2 SM
Units
SM SM SM[94]

Number of
special
function units
for single-
precision 2[96] 4 8 32 16
floating-point
transcendental
functions
Number of 4 8 8 per 2
texture per per 8 per 3
mapping units 2 2
/
SM
4 4 / 8[94] 16 8 16 8
(TMU) SM SM 3SM[94]

Number of
ALU lanes for
uniform INT32 No
arithmetic
operations

Number of
No
tensor cores

Number of
raytracing No
cores

Number of SM
Partitions =
Processing 1 4 2
Blocks[99]
Number of
warp
1 2 4
schedulers per
SM partition
Max number of
new
instructions
issued each 2[101] 1 2[102] 2
cycle by a
single
scheduler[100]
Size of unified
memory for 64 KiB SM + 96 KiB SM + 64 KiB SM + 64 KiB SM +
16 128
data cache 16 KiB[103] 64 KiB 24 KiB L1 24 KiB L1 24 KiB L1 24 KiB L1
KiB[103] KiB
and shared (separate)[104] (separate)[104] (separate)[104] (separate)[104]
memory
Size of L3
instruction 32
cache per KiB[106]
GPU
Size of L2 use L
instruction
cache per
8 KiB
Texture
Processor
Cluster (TPC)
Size of L1.5
instruction 32
cache per KiB
32 KiB 48 KiB[84] 128 KiB
SM[107] 4 KiB
Size of L1
instruction 8 KiB 8K
cache per SM
Size of L0
instruction
only 1 partition per SM No
cache per SM
partition

Instruction 64 bits instructions + 64


32 bits instructions and 64 bits instructions[111] bits control logic every 64 bits instructions + 64 bits control logic every 3
Width[107] 7 instructions
Memory Bus
Width per
64 ((G)DDR) 32 ((G)DDR) 512 (HBM)
Memory
Partition in bits
L2 Cache per 32
Memory 16 KiB[112] 128 KiB 256 KiB 1 MiB 512 KiB 128 KiB 512 KiB
Partition KiB[112]

Number of
Render Output
Units (ROP)
per memory 4 8 4 8 16 8 12
partition (or
per GPC in
later models)

Architecture 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0
specifications Compute capability (version

[115]

For more information read the Nvidia CUDA programming guide.[116]

Current and future usages of CUDA architecture


Accelerated rendering of 3D graphics
Accelerated interconversion of video file formats
Accelerated encryption, decryption and compression
Bioinformatics, e.g. NGS DNA sequencing BarraCUDA[117]
Distributed calculations, such as predicting the native conformation of proteins
Medical analysis simulations, for example virtual reality based on CT and MRI scan images
Physical simulations,[118] in particular in fluid dynamics
Neural network training in machine learning problems
Large Language Model inference
Face recognition
Volunteer computing projects, such as SETI@home and other projects using BOINC software
Molecular dynamics
Mining cryptocurrencies
Structure from motion (SfM) software

Comparison with competitors


CUDA competes with other GPU computing stacks: Intel OneAPI and AMD ROCm.

Whereas Nvidia's CUDA is closed-source, Intel's OneAPI and AMD's ROCm are open source.

Intel OneAPI
oneAPI is an initiative based in open standards, created to support software development for multiple hardware architectures.[119] The oneAPI libraries must
implement open specifications that are discussed publicly by the Special Interest Groups, offering the possibility for any developer or organization to implement
their own versions of oneAPI libraries.[120][121]

Originally made by Intel, other hardware adopters include Fujitsu and Huawei.

Unified Acceleration Foundation (UXL)


Unified Acceleration Foundation (UXL) is a new technology consortium working on the continuation of the OneAPI initiative, with the goal to create a new
open standard accelerator software ecosystem, related open standards and specification projects through Working Groups and Special Interest Groups (SIGs).
The goal is to offer open alternatives to Nvidia's CUDA. The main companies behind it are Intel, Google, ARM, Qualcomm, Samsung, Imagination, and
VMware.[122]

AMD ROCm
ROCm[123] is an open source software stack for graphics processing unit (GPU) programming from Advanced Micro Devices (AMD).

See also
SYCL – an open standard from Khronos Group for programming a variety of platforms, including GPUs, with single-source modern C++,
similar to higher-level CUDA Runtime API (single-source)
BrookGPU – the Stanford University graphics group's compiler
Array programming
Parallel computing
Stream processing
rCUDA – an API for computing on remote computers
Molecular modeling on GPUs
Vulkan – low-level, high-performance 3D graphics and computing API
OptiX – ray tracing API by NVIDIA
CUDA binary (cubin) – a type of fat binary
Numerical Library Collection – by NEC for their vector processor

References
1. "NVIDIA® CUDA™ Unleashes Power of GPU Computing - Press 9. Witt, Stephen (2023-11-27). "How Jensen Huang's Nvidia Is
Release" (https://web.archive.org/web/20070329144655/http://ww Powering the A.I. Revolution" (https://www.newyorker.com/magazin
w.nvidia.com/object/IO_39918.html). nvidia.com. Archived from the e/2023/12/04/how-jensen-huangs-nvidia-is-powering-the-ai-revoluti
original (http://www.nvidia.com/object/IO_39918.html) on 29 March on). The New Yorker. ISSN 0028-792X (https://search.worldcat.org/
2007. Retrieved 26 January 2025. issn/0028-792X). Retrieved 2023-12-10.
2. Shah, Agam. "Nvidia not totally against third parties making CUDA 10. "CUDA LLVM Compiler" (https://developer.nvidia.com/cuda-llvm-co
chips" (https://www.theregister.com/2021/11/10/nvidia_cuda_silico mpiler). 7 May 2012.
n/). www.theregister.com. Retrieved 2024-04-25. 11. First OpenCL demo on a GPU (https://www.youtube.com/watch?v=r
3. "Nvidia CUDA Home Page" (https://developer.nvidia.com/cuda-zon 1sN1ELJfNo) on YouTube
e). 18 July 2017. 12. DirectCompute Ocean Demo Running on Nvidia CUDA-enabled
4. Shimpi, Anand Lal; Wilson, Derek (November 8, 2006). "Nvidia's GPU (https://www.youtube.com/watch?v=K1I4kts5mqc) on
GeForce 8800 (G80): GPUs Re-architected for DirectX 10" (https:// YouTube
www.anandtech.com/show/2116/8). AnandTech. Retrieved May 16, 13. Vasiliadis, Giorgos; Antonatos, Spiros; Polychronakis, Michalis;
2015. Markatos, Evangelos P.; Ioannidis, Sotiris (September 2008).
5. "Introduction — nsight-visual-studio-edition 12.6 documentation" (ht "Gnort: High Performance Network Intrusion Detection Using
tps://docs.nvidia.com/nsight-visual-studio-edition/introduction/index. Graphics Processors" (http://www.ics.forth.gr/dcs/Activities/papers/
html#cuda-debugger). docs.nvidia.com. Retrieved 2024-10-10. gnort.raid08.pdf) (PDF). Recent Advances in Intrusion Detection.
6. Abi-Chahla, Fedy (June 18, 2008). "Nvidia's CUDA: The End of the Lecture Notes in Computer Science. Vol. 5230. pp. 116–134.
CPU?" (https://www.tomshardware.com/reviews/nvidia-cuda-gpu,19 doi:10.1007/978-3-540-87403-4_7 (https://doi.org/10.1007%2F978-
54.html). Tom's Hardware. Retrieved May 17, 2015. 3-540-87403-4_7). ISBN 978-3-540-87402-7.
7. Zunitch, Peter (2018-01-24). "CUDA vs. OpenCL vs. OpenGL" (http 14. Schatz, Michael C.; Trapnell, Cole; Delcher, Arthur L.; Varshney,
s://www.videomaker.com/article/c15/19313-cuda-vs-opencl-vs-open Amitabh (2007). "High-throughput sequence alignment using
gl). Videomaker. Retrieved 2018-09-16. Graphics Processing Units" (https://www.ncbi.nlm.nih.gov/pmc/articl
es/PMC2222658). BMC Bioinformatics. 8: 474. doi:10.1186/1471-
8. "OpenCL" (https://developer.nvidia.com/opencl). NVIDIA Developer.
2105-8-474 (https://doi.org/10.1186%2F1471-2105-8-474).
2013-04-24. Retrieved 2019-11-04.
PMC 2222658 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2222
658). PMID 18070356 (https://pubmed.ncbi.nlm.nih.gov/18070356).
15. Manavski, Svetlin A.; Giorgio, Valle (2008). "CUDA compatible GPU 36. "PyCUDA" (http://mathema.tician.de/software/pycuda).
cards as efficient hardware accelerators for Smith-Waterman 37. "pycublas" (https://web.archive.org/web/20090420124748/http://ker
sequence alignment" (https://www.ncbi.nlm.nih.gov/pmc/articles/PM ed.org/blog/2009-04-13/easy-python-numpy-cuda-cublas/).
C2323659). BMC Bioinformatics. 10 (Suppl 2): S10. Archived from the original (http://kered.org/blog/2009-04-13/easy-py
doi:10.1186/1471-2105-9-S2-S10 (https://doi.org/10.1186%2F1471- thon-numpy-cuda-cublas/) on 2009-04-20. Retrieved 2017-08-08.
2105-9-S2-S10). PMC 2323659 (https://www.ncbi.nlm.nih.gov/pmc/
38. "CuPy" (https://cupy.dev/). Retrieved 2020-01-08.
articles/PMC2323659). PMID 18387198 (https://pubmed.ncbi.nlm.ni
h.gov/18387198). 39. "NVIDIA CUDA Programming Guide. Version 1.0" (http://developer.
download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programmi
16. "Pyrit – Google Code" (https://code.google.com/p/pyrit/).
ng_Guide_1.0.pdf) (PDF). June 23, 2007.
17. "Use your Nvidia GPU for scientific computing" (https://web.archive.
40. "NVIDIA CUDA Programming Guide. Version 2.1" (http://developer.
org/web/20081228022142/http://boinc.berkeley.edu/cuda.php).
download.nvidia.com/compute/cuda/2_1/toolkit/docs/NVIDIA_CUD
BOINC. 2008-12-18. Archived from the original (http://boinc.berkele A_Programming_Guide_2.1.pdf) (PDF). December 8, 2008.
y.edu/cuda.php) on 2008-12-28. Retrieved 2017-08-08.
41. "NVIDIA CUDA Programming Guide. Version 2.2" (http://developer.
18. "Nvidia CUDA Software Development Kit (CUDA SDK) – Release
download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUD
Notes Version 2.0 for MAC OS X" (https://web.archive.org/web/200
A_Programming_Guide_2.2.pdf) (PDF). April 2, 2009.
90106020401/http://developer.download.nvidia.com/compute/cuda/
sdk/website/doc/CUDA_SDK_release_notes_macosx.txt). Archived 42. "NVIDIA CUDA Programming Guide. Version 2.2.1" (http://develope
from the original (http://developer.download.nvidia.com/compute/cu r.download.nvidia.com/compute/cuda/2_21/toolkit/docs/NVIDIA_CU
da/sdk/website/doc/CUDA_SDK_release_notes_macosx.txt) on DA_Programming_Guide_2.2.1.pdf) (PDF). May 26, 2009.
2009-01-06. 43. "NVIDIA CUDA Programming Guide. Version 2.3.1" (http://develope
19. "CUDA 1.1 – Now on Mac OS X" (https://web.archive.org/web/2008 r.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUD
1122105633/http://news.developer.nvidia.com/2008/02/cuda-11---n A_Programming_Guide_2.3.pdf) (PDF). August 26, 2009.
ow-o.html). February 14, 2008. Archived from the original (http://ne 44. "NVIDIA CUDA Programming Guide. Version 3.0" (http://developer.
ws.developer.nvidia.com/2008/02/cuda-11---now-o.html) on download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUD
November 22, 2008. A_ProgrammingGuide.pdf) (PDF). February 20, 2010.
20. "CUDA 11 Features Revealed" (https://developer.nvidia.com/blog/c 45. "NVIDIA CUDA C Programming Guide. Version 3.1.1" (http://develo
uda-11-features-revealed/). 14 May 2020. per.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_C
21. "CUDA Toolkit 11.1 Introduces Support for GeForce RTX 30 Series UDA_C_ProgrammingGuide_3.1.pdf) (PDF). July 21, 2010.
and Quadro RTX Series GPUs" (https://developer.nvidia.com/blog/c 46. "NVIDIA CUDA C Programming Guide. Version 3.2" (http://develope
uda-11-1-introduces-support-rtx-30-series/). 23 September 2020. r.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_
22. "Enhancing Memory Allocation with New NVIDIA CUDA 11.2 C_Programming_Guide.pdf) (PDF). November 9, 2010.
Features" (https://developer.nvidia.com/blog/enhancing-memory-all 47. "CUDA 11.0 Release Notes" (https://docs.nvidia.com/cuda/archive/
ocation-with-new-cuda-11-2-features/). 16 December 2020. 11.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
23. "Exploring the New Features of CUDA 11.3" (https://developer.nvidi 48. "CUDA 11.1 Release Notes" (https://docs.nvidia.com/cuda/archive/
a.com/blog/exploring-the-new-features-of-cuda-11-3/). 16 April 11.1.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
2021. 49. "CUDA 11.5 Release Notes" (https://docs.nvidia.com/cuda/archive/
24. Silberstein, Mark; Schuster, Assaf; Geiger, Dan; Patney, Anjul; 11.5.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
Owens, John D. (2008). "Efficient computation of sum-products on 50. "CUDA 11.8 Release Notes" (https://docs.nvidia.com/cuda/archive/
GPUs through software-managed cache" (https://escholarship.org/c 11.8.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
ontent/qt8js4v3f7/qt8js4v3f7.pdf?t=ptt3te) (PDF). Proceedings of 51. "NVIDIA Quadro NVS 420 Specs" (https://www.techpowerup.com/g
the 22nd annual international conference on Supercomputing – ICS pu-specs/quadro-nvs-420.c1448). TechPowerUp GPU Database.
'08 (https://escholarship.org/content/qt8js4v3f7/qt8js4v3f7.pdf?t=ptt 25 August 2023.
3te) (PDF). Proceedings of the 22nd annual international
52. Larabel, Michael (March 29, 2017). "NVIDIA Rolls Out Tegra X2
conference on Supercomputing – ICS '08. pp. 309–318.
GPU Support In Nouveau" (http://www.phoronix.com/scan.php?pag
doi:10.1145/1375527.1375572 (https://doi.org/10.1145%2F137552
7.1375572). ISBN 978-1-60558-158-3. e=news_item&px=Tegra-X2-Nouveau-Support). Phoronix.
Retrieved August 8, 2017.
25. "CUDA C Programming Guide v8.0" (http://docs.nvidia.com/cuda/pd
53. Nvidia Xavier Specs (https://www.techpowerup.com/gpudb/3232/xa
f/CUDA_C_Programming_Guide.pdf) (PDF). nVidia Developer
vier) on TechPowerUp (preliminary)
Zone. January 2017. p. 19. Retrieved 22 March 2017.
54. "Welcome — Jetson LinuxDeveloper Guide 34.1 documentation" (h
26. "NVCC forces c++ compilation of .cu files" (https://devtalk.nvidia.co
ttps://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%
m/default/topic/508479/cuda-programming-and-performance/nvcc-f
20Driver%20Package%20Development%20Guide/power_manage
orces-c-compilation-of-cu-files/#entry1340190). 29 November 2011.
ment_jetson_xavier.html).
27. Whitehead, Nathan; Fit-Florea, Alex. "Precision & Performance:
55. "NVIDIA Bringing up Open-Source Volta GPU Support for Their
Floating Point and IEEE 754 Compliance for Nvidia GPUs" (https://
Xavier SoC" (https://www.phoronix.com/scan.php?page=news_item
developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-C
&px=NVIDIA-Nouveau-GV11B-Volta-Xav).
UDA-Floating-Point.pdf) (PDF). Nvidia. Retrieved November 18,
2014. 56. "NVIDIA Ada Lovelace Architecture" (https://www.nvidia.com/en-us/
geforce/ada-lovelace-architecture/).
28. "CUDA-Enabled Products" (https://www.nvidia.com/object/cuda_lea
rn_products.html). CUDA Zone. Nvidia Corporation. Retrieved 57. Dissecting the Turing GPU Architecture through Microbenchmarking
2008-11-03. (https://developer.download.nvidia.com/video/gputechconf/gtc/201
9/presentation/s9839-discovering-the-turing-t4-gpu-architecture-wit
29. "Coriander Project: Compile CUDA Codes To OpenCL, Run
h-microbenchmarks.pdf)
Everywhere" (http://www.phoronix.com/scan.php?page=news_item
&px=CUDA-On-CL-Coriander). Phoronix. 58. "H.1. Features and Technical Specifications – Table 13. Feature
30. Perkins, Hugh (2017). "cuda-on-cl" (http://www.iwocl.org/wp-conten Support per Compute Capability" (https://docs.nvidia.com/cuda/cud
a-c-programming-guide/index.html#features-and-technical-specifica
t/uploads/iwocl2017-hugh-perkins-cuda-cl.pdf) (PDF). IWOCL.
tions). docs.nvidia.com. Retrieved 2020-09-23.
Retrieved August 8, 2017.
59. "CUDA C++ Programming Guide" (https://docs.nvidia.com/cuda/cud
31. "hughperkins/coriander: Build NVIDIA® CUDA™ code for
a-c-programming-guide/index.html#features-and-technical-specifica
OpenCL™ 1.2 devices" (https://github.com/hughperkins/coriander).
tions).
GitHub. May 6, 2019.
60. Fused-Multiply-Add, actually executed, Dense Matrix
32. "CU2CL Documentation" (http://chrec.cs.vt.edu/cu2cl/documentatio
n.php). chrec.cs.vt.edu. 61. as SASS since 7.5, as PTX since 8.0
33. "GitHub – vosen/ZLUDA" (https://github.com/vosen/ZLUDA). 62. unofficial support in SASS
GitHub. 63. "Technical brief. NVIDIA Jetson AGX Orin Series" (https://www.nvidi
34. Larabel, Michael (2024-02-12), "AMD Quietly Funded A Drop-In a.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-
CUDA Implementation Built On ROCm: It's Now Open-Source" (http agx-orin-technical-brief.pdf) (PDF). nvidia.com. Retrieved
s://www.phoronix.com/review/radeon-cuda-zluda), Phoronix, 5 September 2023.
retrieved 2024-02-12
35. "GitHub – chip-spv/chipStar" (https://github.com/chip-spv/chipStar).
GitHub.
64. "NVIDIA Ampere GA102 GPU Architecture" (https://images.nvidia.c 85. NVIDIA H100 Tensor Core GPU Architecture (https://nvdam.widen.n
om/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere- et/s/5bx55xfnxf/gtc22-whitepaper-hopper)
GA102-GPU-Architecture-Whitepaper-V1.pdf) (PDF). nvidia.com. 86. H.1. Features and Technical Specifications – Table 14. Technical
Retrieved 5 September 2023. Specifications per Compute Capability (https://docs.nvidia.com/cud
65. Luo, Weile; Fan, Ruibo; Li, Zeyu; Du, Dayou; Wang, Qiang; Chu, a/cuda-c-programming-guide/index.html#features-and-technical-sp
Xiaowen (2024). "Benchmarking and Dissecting the Nvidia Hopper ecifications)
GPU Architecture". arXiv:2402.13499v1 (https://arxiv.org/abs/2402. 87. NVIDIA Hopper Architecture In-Depth (https://developer.nvidia.com/
13499v1) [cs.AR (https://arxiv.org/archive/cs.AR)]. blog/nvidia-hopper-architecture-in-depth)
66. "Datasheet NVIDIA A40" (https://images.nvidia.com/content/Solutio 88. can only execute 160 integer instructions according to programming
ns/data-center/a40/nvidia-a40-datasheet.pdf) (PDF). nvidia.com. guide
Retrieved 27 April 2024.
89. 128 according to [1] (https://docs.nvidia.com/cuda/cuda-c-program
67. "NVIDIA AMPERE GA102 GPU ARCHITECTURE" (https://www.nvi ming-guide/index.html#arithmetic-instructions). 64 from FP32 + 64
dia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-white separate units?
paper-v2.1.pdf) (PDF). 27 April 2024.
90. 64 by FP32 cores and 64 by flexible FP32/INT cores.
68. "Datasheet NVIDIA L40" (https://www.nvidia.com/content/dam/en-z
91. "CUDA C++ Programming Guide" (https://docs.nvidia.com/cuda/cud
z/Solutions/design-visualization/support-guide/NVIDIA-L40-Datashe
a-c-programming-guide/index.html#arithmetic-instructions).
et-January-2023.pdf) (PDF). 27 April 2024.
92. 32 FP32 lanes combine to 16 FP64 lanes. Maybe lower depending
69. In the Whitepapers the Tensor Core cube diagrams represent the
on model.
Dot Product Unit Width into the height (4 FP16 for Volta and Turing,
8 FP16 for A100, 4 FP16 for GA102, 16 FP16 for GH100). The 93. only supported by 16 FP32 lanes, they combine to 4 FP64 lanes
other two dimensions represent the number of Dot Product Units 94. depending on model
(4x4 = 16 for Volta and Turing, 8x4 = 32 for Ampere and Hopper). 95. Effective speed, probably over FP32 ports. No description of actual
The resulting gray blocks are the FP16 FMA operations per cycle. FP64 cores.
Pascal without Tensor core is only shown for speed comparison as 96. Can also be used for integer additions and comparisons
is Volta V100 with non-FP16 datatypes.
97. 2 clock cycles/instruction for each SM partition Burgess, John
70. "NVIDIA Turing Architecture Whitepaper" (https://images.nvidia.co (2019). "RTX ON – The NVIDIA TURING GPU" (https://ieeexplore.i
m/aem-dam/en-zz/Solutions/design-visualization/technologies/turin eee.org/document/8875651). 2019 IEEE Hot Chips 31 Symposium
g-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf) (PDF). (HCS). pp. 1–27. doi:10.1109/HOTCHIPS.2019.8875651 (https://do
nvidia.com. Retrieved 5 September 2023. i.org/10.1109%2FHOTCHIPS.2019.8875651). ISBN 978-1-7281-
71. "NVIDIA Tensor Core GPU" (https://www.nvidia.com/content/dam/e 2089-8. S2CID 204822166 (https://api.semanticscholar.org/CorpusI
n-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvid D:204822166).
ia-1758950-r4-web.pdf) (PDF). nvidia.com. Retrieved 5 September 98. Durant, Luke; Giroux, Olivier; Harris, Mark; Stam, Nick (May 10,
2023. 2017). "Inside Volta: The World's Most Advanced Data Center
72. "NVIDIA Hopper Architecture In-Depth" (https://developer.nvidia.co GPU" (https://devblogs.nvidia.com/inside-volta/). Nvidia developer
m/blog/nvidia-hopper-architecture-in-depth/). 22 March 2022. blog.
73. shape x converted operand size, e.g. 2 tensor cores x 99. The schedulers and dispatchers have dedicated execution units
4x4x4xFP16/cycle = 256 Bytes/cycle unlike with Fermi and Kepler.
74. = product first 3 table rows 100. Dispatching can overlap concurrently, if it takes more than one
75. = product of previous 2 table rows; shape: e.g. 8x8x4xFP16 = 512 cycle (when there are less execution units than 32/SM Partition)
Bytes 101. Can dual issue MAD pipe and SFU pipe
76. Sun, Wei; Li, Ang; Geng, Tong; Stuijk, Sander; Corporaal, Henk 102. No more than one scheduler can issue 2 instructions at once. The
(2023). "Dissecting Tensor Cores via Microbenchmarks: Latency, first scheduler is in charge of warps with odd IDs. The second
Throughput and Numeric Behaviors". IEEE Transactions on Parallel scheduler is in charge of warps with even IDs.
and Distributed Systems. 34 (1): 246–261. arXiv:2206.02874 (http 103. shared memory only, no data cache
s://arxiv.org/abs/2206.02874). doi:10.1109/tpds.2022.3217824 (http
104. shared memory separate, but L1 includes texture cache
s://doi.org/10.1109%2Ftpds.2022.3217824). S2CID 249431357 (htt
ps://api.semanticscholar.org/CorpusID:249431357). 105. "H.6.1. Architecture" (https://docs.nvidia.com/cuda/cuda-c-program
ming-guide/index.html#architecture-7-x). docs.nvidia.com.
77. "Parallel Thread Execution ISA Version 7.7" (https://docs.nvidia.co
Retrieved 2019-05-13.
m/cuda/parallel-thread-execution/index.html#warp-level-matrix-instr
uctions-mma). 106. "Demystifying GPU Microarchitecture through Microbenchmarking"
(https://www.stuffedcow.net/files/gpuarch-ispass2010.pdf) (PDF).
78. Raihan, Md Aamir; Goli, Negar; Aamodt, Tor (2018). "Modeling
Deep Learning Accelerator Enabled GPUs". arXiv:1811.08309 (http 107. Jia, Zhe; Maggioni, Marco; Staiger, Benjamin; Scarpazza, Daniele
s://arxiv.org/abs/1811.08309) [cs.MS (https://arxiv.org/archive/cs.M P. (2018). "Dissecting the NVIDIA Volta GPU Architecture via
S)]. Microbenchmarking". arXiv:1804.06826 (https://arxiv.org/abs/1804.
06826) [cs.DC (https://arxiv.org/archive/cs.DC)].
79. "NVIDIA Ada Lovelace Architecture" (https://www.nvidia.com/en-gb/
geforce/ada-lovelace-architecture). 108. Jia, Zhe; Maggioni, Marco; Smith, Jeffrey; Daniele Paolo Scarpazza
(2019). "Dissecting the NVidia Turing T4 GPU via
80. Jia, Zhe; Maggioni, Marco; Smith, Jeffrey; Daniele Paolo Scarpazza
Microbenchmarking". arXiv:1903.07486 (https://arxiv.org/abs/1903.
(2019). "Dissecting the NVidia Turing T4 GPU via
07486) [cs.DC (https://arxiv.org/archive/cs.DC)].
Microbenchmarking". arXiv:1903.07486 (https://arxiv.org/abs/1903.
07486) [cs.DC (https://arxiv.org/archive/cs.DC)]. 109. "Dissecting the Ampere GPU Architecture through
81. Burgess, John (2019). "RTX ON – The NVIDIA TURING GPU" (http Microbenchmarking" (https://www.nvidia.com/en-us/on-demand/ses
sion/gtcspring21-s33322/).
s://ieeexplore.ieee.org/document/8875651). 2019 IEEE Hot Chips
31 Symposium (HCS). pp. 1–27. 110. Note that Jia, Zhe; Maggioni, Marco; Smith, Jeffrey; Daniele Paolo
doi:10.1109/HOTCHIPS.2019.8875651 (https://doi.org/10.1109%2F Scarpazza (2019). "Dissecting the NVidia Turing T4 GPU via
HOTCHIPS.2019.8875651). ISBN 978-1-7281-2089-8. Microbenchmarking". arXiv:1903.07486 (https://arxiv.org/abs/1903.
S2CID 204822166 (https://api.semanticscholar.org/CorpusID:20482 07486) [cs.DC (https://arxiv.org/archive/cs.DC)]. disagrees and
2166). states 2 KiB L0 instruction cache per SM partition and 16 KiB L1
instruction cache per SM
82. Burgess, John (2019). "RTX ON – The NVIDIA TURING GPU" (http
s://ieeexplore.ieee.org/document/8875651). 2019 IEEE Hot Chips 111. "asfermi Opcode" (https://github.com/hyqneuron/asfermi/wiki/Opcod
31 Symposium (HCS). pp. 1–27. e). GitHub.
doi:10.1109/HOTCHIPS.2019.8875651 (https://doi.org/10.1109%2F 112. for access with texture engine only
HOTCHIPS.2019.8875651). ISBN 978-1-7281-2089-8. 113. 25% disabled on RTX 4060, RTX 4070, RTX 4070 Ti and RTX 4090
S2CID 204822166 (https://api.semanticscholar.org/CorpusID:20482
114. 25% disabled on RTX 5070 Ti and RTX 5090
2166).
115. "I.7. Compute Capability 8.x" (https://docs.nvidia.com/cuda/cuda-c-
83. dependent on device
programming-guide/index.html#compute-capability-8-x).
84. "Tegra X1" (https://developer.nvidia.com/content/tegra-x1). 9 docs.nvidia.com. Retrieved 2022-10-12.
January 2015.
116. "Appendix F. Features and Technical Specifications" (http://develop 120. "Specifications | oneAPI" (https://www.oneapi.io/spec/). oneAPI.io.
er.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_ Retrieved 2024-07-27.
C_Programming_Guide.pdf) (PDF). (3.2 MiB), page 148 of 175 121. "oneAPI Specification — oneAPI Specification 1.3-rev-1
(Version 5.0 October 2012). documentation" (https://oneapi-spec.uxlfoundation.org/specification
117. "nVidia CUDA Bioinformatics: BarraCUDA" (https://www.biocentric. s/oneapi/v1.3-rev-1/). oneapi-spec.uxlfoundation.org. Retrieved
nl/biocentric/nvidia-cuda-bioinformatics-barracuda/). BioCentric. 2024-07-27.
2019-07-19. Retrieved 2019-10-15. 122. "Exclusive: Behind the plot to break Nvidia's grip on AI by targeting
118. "Part V: Physics Simulation" (https://developer.nvidia.com/gpugem software" (https://www.reuters.com/technology/behind-plot-break-nv
s/gpugems3/part-v-physics-simulation). NVIDIA Developer. idias-grip-ai-by-targeting-software-2024-03-25/). Reuters. Retrieved
Retrieved 2020-09-11. 2024-04-05.
119. "oneAPI Programming Model" (https://www.oneapi.io/). oneAPI.io. 123. "Question: What does ROCm stand for? · Issue #1628 ·
Retrieved 2024-07-27. RadeonOpenCompute/ROCm" (https://github.com/RadeonOpenCo
mpute/ROCm/issues/1628). Github.com. Retrieved January 18,
2022.

Further reading
Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Fatahalian, Kayvon; Houston, Mike; Hanrahan, Pat (2004-08-01). "Brook for GPUs:
stream computing on graphics hardware" (https://dl.acm.org/doi/10.1145/1015706.1015800). ACM Transactions on Graphics. 23 (3): 777–
786. doi:10.1145/1015706.1015800 (https://doi.org/10.1145%2F1015706.1015800). ISSN 0730-0301 (https://search.worldcat.org/issn/0730-0
301).
Nickolls, John; Buck, Ian; Garland, Michael; Skadron, Kevin (2008-03-01). "Scalable Parallel Programming with CUDA: Is CUDA the parallel
programming model that application developers have been waiting for?" (https://dl.acm.org/doi/10.1145/1365490.1365500). Queue. 6 (2): 40–
53. doi:10.1145/1365490.1365500 (https://doi.org/10.1145%2F1365490.1365500). ISSN 1542-7730 (https://search.worldcat.org/issn/1542-77
30).

External links
Official website (https://developer.nvidia.com/cuda-zone)

Retrieved from "https://en.wikipedia.org/w/index.php?title=CUDA&oldid=1273998829"

You might also like