CUDA
CUDA
In computing, CUDA is a proprietary[2] parallel computing platform and application programming interface
CUDA
(API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-
purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia
in 2006.[3] When it was first introduced, the name was an acronym for Compute Unified Device
Architecture,[4] but Nvidia later dropped the common use of the acronym and now rarely expands it.[5]
CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel
computational elements for the execution of compute kernels.[6] In addition to drivers and runtime kernels,
the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their
Developer(s) Nvidia
applications.
Initial release February 16, 2007[1]
CUDA is designed to work with programming languages such as C, C++, Fortran and Python. This Stable release 12.8 / January 2025
accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior
Operating system Windows, Linux
APIs like Direct3D and OpenGL, which require advanced skills in graphics programming.[7] CUDA-powered
GPUs also support programming frameworks such as OpenMP, OpenACC and OpenCL.[8][6] Platform Supported GPUs
Type GPGPU
License Proprietary
Background Website developer.nvidia.com
/cuda-zone (https://develop
The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of real-time
er.nvidia.com/cuda-zone)
high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs had evolved into highly parallel multi-
core systems allowing efficient manipulation of large blocks of data. This design is more effective than
general-purpose central processing unit (CPUs) for algorithms in situations where processing large blocks of data is done in parallel, such as:
Ontology
The following table offers a non-exact description for the ontology of CUDA framework.
GPU L1
local, shared SM ("streaming multiprocessor") block individual subroutine call
cache
Programming abilities
The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives such as OpenACC, and extensions to
industry-standard programming languages including C, C++, Fortran and Python. C/C++ programmers can use 'CUDA C/C++', compiled to PTX with nvcc,
Nvidia's LLVM-based C/C++ compiler, or by clang itself.[10] Fortran programmers can use 'CUDA Fortran', compiled with the PGI CUDA Fortran compiler
from The Portland Group. Python programmers can use the cuNumeric library to accelerate applications on Nvidia GPUs.
In addition to libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA platform supports other computational interfaces, including the
Khronos Group's OpenCL,[11] Microsoft's DirectCompute, OpenGL Compute Shader and C++ AMP.[12] Third party wrappers are also available for Python,
Perl, Fortran, Java, Ruby, Lua, Common Lisp, Haskell, R, MATLAB, IDL, Julia, and native support in Mathematica.
In the computer game industry, GPUs are used for graphics rendering, and for game physics calculations
(physical effects such as debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also
been used to accelerate non-graphical applications in computational biology, cryptography and other fields
by an order of magnitude or more.[13][14][15][16][17]
CUDA provides both a low level API (CUDA Driver API, non single-source) and a higher level API
(CUDA Runtime API, single-source). The initial CUDA SDK was made public on 15 February 2007, for
Microsoft Windows and Linux. Mac OS X support was later added in version 2.0,[18] which supersedes the
beta released February 14, 2008.[19] CUDA works with all Nvidia GPUs from the G8x series onwards,
including GeForce, Quadro and the Tesla line. CUDA is compatible with most standard operating systems.
CUDA 8.0 comes with the following libraries (for compilation & runtime, in alphabetical order):
Advantages
CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs:
Limitations
Whether for the host computer or the GPU device, all CUDA source code is now processed according to C++ syntax rules.[25] This was not
always the case. Earlier versions of CUDA were based on C syntax rules.[26] As with the more general case of compiling C code with a C++
compiler, it is therefore possible that old C-style CUDA source code will either fail to compile or will not behave as originally intended.
Interoperability with rendering languages such as OpenGL is one-way, with OpenGL having access to registered CUDA memory but CUDA
not having access to OpenGL memory.
Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly
alleviated with asynchronous memory transfers, handled by the GPU's DMA engine).
Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches
in the program code do not affect performance significantly, provided that each of 32 threads takes the same execution path; the SIMD
execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning data structure during
ray tracing).
No emulation or fallback functionality is available for modern revisions.
Valid C++ may sometimes be flagged and prevent compilation due to the way the compiler approaches optimization for target GPU device
limitations.
C++ run-time type information (RTTI) and C++-style exception handling are only supported in host code, not in device code.
In single-precision on first generation CUDA compute capability 1.x devices, denormal numbers are unsupported and are instead flushed to
zero, and the precision of both the division and square root operations are slightly lower than IEEE 754-compliant single precision math.
Devices that support compute capability 2.0 and above support denormal numbers, and the division and square root operations are IEEE 754
compliant by default. However, users can obtain the prior faster gaming-grade math of compute capability 1.x devices if desired by setting
compiler flags to disable accurate divisions and accurate square roots, and enable flushing denormal numbers to zero.[27]
Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia as it is proprietary.[28][2] Attempts to implement CUDA on other GPUs
include:
Project Coriander: Converts CUDA C++11 source to OpenCL 1.2 C. A fork of CUDA-on-CL intended to run TensorFlow.[29][30][31]
CU2CL: Convert CUDA 3.2 C++ to OpenCL C.[32]
GPUOpen HIP: A thin abstraction layer on top of CUDA and ROCm intended for AMD and Nvidia GPUs. Has a conversion tool for
importing CUDA C++ source. Supports CUDA 4.0 plus C++11 and float16.
ZLUDA is a drop-in replacement for CUDA on AMD GPUs and formerly Intel GPUs with near-native performance.[33] The developer,
Andrzej Janik, was separately contracted by both Intel and AMD to develop the software in 2021 and 2022, respectively. However, neither
company decided to release it officially due to the lack of a business use case. AMD's contract included a clause that allowed Janik to
release his code for AMD independently, allowing him to release the new version that only supports AMD GPUs.[34]
chipStar can compile and run CUDA/HIP programs on advanced OpenCL 3.0 or Level Zero platforms.[35]
Example
This example code in C++ loads a texture from an image into an array on the GPU:
void foo()
{
cudaArray* cu_array;
// Allocate array
cudaChannelFormatDesc description = cudaCreateChannelDesc<float>();
cudaMallocArray(&cu_array, &description, width, height);
// Run kernel
dim3 blockDim(16, 16, 1);
dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);
Below is an example given in Python that computes the product of two arrays on the GPU. The unofficial Python language bindings can be obtained from
PyCUDA.[36]
mod = comp.SourceModule(
"""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
"""
)
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(drv.Out(dest), drv.In(a), drv.In(b), block=(400, 1, 1))
print(dest - a * b)
Additional Python bindings to simplify matrix multiplication operations can be found in the program pycublas.[37]
import numpy
from pycublas import CUBLASMatrix
import cupy
a = cupy.random.randn(400)
b = cupy.random.randn(400)
dest = cupy.zeros_like(a)
print(dest - a * b)
GPUs supported
Supported CUDA compute capability versions for CUDA SDK version and microarchitecture (by code name):
Note: CUDA SDK 10.2 is the last official release for macOS, as support will not be available for macOS in newer releases.
CUDA compute capability by version with associated GPU semiconductors and GPU card models (separated by their various application areas):
Compute capability, GPU semiconductors and Nvidia GPU board products
Compute Tegra,
Micro-
capability GPUs GeForce Quadro, NVS Tesla/Datacenter Jetson,
architecture
(version) DRIVE
GeForce 8800 Ultra, Quadro FX 5600,
GeForce 8800 GTX, Quadro FX 4600, Tesla C870, Tesla D870, Tesla
1.0 G80
GeForce 8800 Quadro Plex 2100 S870
GTS(G80) S4
Quadro FX 4700
X2, Quadro FX
3700, Quadro FX
1800, Quadro FX
1700, Quadro FX
580, Quadro FX
570, Quadro FX
470, Quadro FX
380, Quadro FX
370, Quadro FX
370 Low Profile,
Quadro NVS 450,
Quadro NVS 420,
GeForce GTS 250, Quadro NVS 290,
GeForce 9800 GX2, Quadro NVS 295,
GeForce 9800 GTX, Quadro Plex 2100
GeForce 9800 GT, D4,
GeForce 8800 Quadro FX
GTS(G92), GeForce 3800M, Quadro
8800 GT, GeForce FX 3700M,
9600 GT, GeForce Quadro FX
9500 GT, GeForce 3600M, Quadro
G92, G94, G96, G98, G84,
1.1 9400 GT, GeForce FX 2800M,
G86
8600 GTS, GeForce Quadro FX
8600 GT, GeForce 2700M, Quadro
8500 GT, FX 1700M,
GeForce G110M, Quadro FX
GeForce 9300M GS, 1600M, Quadro
GeForce 9200M GS, FX 770M, Quadro
GeForce 9100M G, FX 570M, Quadro
GeForce 8400M GT, FX 370M, Quadro
GeForce G105M FX 360M, Quadro
Tesla
NVS 320M,
Quadro NVS
160M, Quadro
NVS 150M,
Quadro NVS
140M, Quadro
NVS 135M,
Quadro NVS
130M, Quadro
NVS 450, Quadro
NVS 420,[51]
Quadro NVS 295
GeForce GT 340*,
GeForce GT 330*,
GeForce GT 320*,
GeForce 315*, GeForce Quadro FX 380
310*, GeForce GT 240, Low Profile,
GeForce GT 220, Quadro FX
GeForce 210, 1800M, Quadro
GeForce GTS 360M, FX 880M, Quadro
1.2 GT218, GT216, GT215
GeForce GTS 350M, FX 380M,
GeForce GT 335M, Nvidia NVS 300,
GeForce GT 330M, NVS 5100M, NVS
GeForce GT 325M, 3100M, NVS
GeForce GT 240M, 2100M, ION
GeForce G210M,
GeForce 310M,
GeForce 305M
Quadro FX 5800,
Quadro FX 4800,
GeForce GTX 295,
Quadro FX 4800
GTX 285, GTX 280, Tesla C1060, Tesla S1070,
1.3 GT200, GT200b for Mac, Quadro
GeForce GTX 275, Tesla M1060
FX 3800, Quadro
GeForce GTX 260
CX, Quadro Plex
2200 D2
Fermi Quadro 6000,
GeForce GTX 590,
Quadro 5000,
GeForce GTX 580,
Quadro 4000,
GeForce GTX 570, Tesla C2075, Tesla
Quadro 4000 for
2.0 GF100, GF110 GeForce GTX 480, C2050/C2070, Tesla
Mac, Quadro Plex
GeForce GTX 470, M2050/M2070/M2075/M2090
7000,
GeForce GTX 465,
Quadro 5010M,
GeForce GTX 480M
Quadro 5000M
GeForce GTX 560 Ti,
GeForce GTX 550 Ti,
GeForce GTX 460,
GeForce GTS 450,
GeForce GTS 450*,
GeForce GT 640
(GDDR3), GeForce GT
630, GeForce GT 620,
GeForce GT 610,
GeForce GT 520,
GeForce GT 440,
GeForce GT 440*,
GeForce GT 430,
GeForce GT 430*,
GeForce GT 420*,
GeForce GTX 675M,
GeForce GTX 670M, Quadro 2000,
GeForce GT 635M, Quadro 2000D,
GeForce GT 630M, Quadro 600,
GeForce GT 625M, Quadro 4000M,
GF104, GF106 GF108, GeForce GT 720M, Quadro 3000M,
2.1 GF114, GF116, GF117, GeForce GT 620M, Quadro 2000M,
GF119 GeForce 710M, Quadro 1000M,
GeForce 610M, NVS 310, NVS
GeForce 820M, 315, NVS 5400M,
GeForce GTX 580M, NVS 5200M, NVS
GeForce GTX 570M, 4200M
GeForce GTX 560M,
GeForce GT 555M,
GeForce GT 550M,
GeForce GT 540M,
GeForce GT 525M,
GeForce GT 520MX,
GeForce GT 520M,
GeForce GTX 485M,
GeForce GTX 470M,
GeForce GTX 460M,
GeForce GT 445M,
GeForce GT 435M,
GeForce GT 420M,
GeForce GT 415M,
GeForce 710M,
GeForce 410M
GeForce GTX 770,
GeForce GTX 760,
GeForce GT 740,
GeForce GTX 690,
GeForce GTX 680,
GeForce GTX 670,
Quadro K5000,
GeForce GTX 660 Ti,
Quadro K4200,
GeForce GTX 660,
Quadro K4000,
GeForce GTX 650 Ti
Quadro K2000,
BOOST, GeForce GTX
Quadro K2000D,
650 Ti, GeForce GTX
Quadro K600,
650,
Quadro K420,
GeForce GTX 880M,
Quadro K500M,
GeForce GTX 870M,
Quadro K510M,
GeForce GTX 780M,
Quadro K610M,
GeForce GTX 770M,
Quadro K1000M, Tesla K10, GRID K340, GRID
3.0 GK104, GK106, GK107 GeForce GTX 765M,
Quadro K2000M, K520, GRID K2
GeForce GTX 760M,
Quadro K1100M,
GeForce GTX 680MX,
Quadro K2100M,
GeForce GTX 680M,
Quadro K3000M,
GeForce GTX 675MX,
Quadro K3100M,
GeForce GTX 670MX,
Quadro K4000M,
GeForce GTX 660M,
Quadro K5000M,
GeForce GT 750M,
Quadro K4100M,
GeForce GT 650M,
Quadro K5100M,
Kepler GeForce GT 745M,
NVS 510, Quadro
GeForce GT 645M,
410
GeForce GT 740M,
GeForce GT 730M,
GeForce GT 640M,
GeForce GT 640M LE,
GeForce GT 735M,
GeForce GT 730M
Tegra K1,
3.2 GK20A
Jetson TK1
GeForce GTX Titan Z,
GeForce GTX Titan
Black, GeForce GTX
Titan, GeForce GTX
780 Ti, GeForce GTX
780, GeForce GT 640
Quadro K6000, Tesla K40, Tesla K20x, Tesla
3.5 GK110, GK208 (GDDR5), GeForce GT
Quadro K5200 K20
630 v2, GeForce GT
730, GeForce GT 720,
GeForce GT 710,
GeForce GT 740M (64-
bit, DDR3), GeForce
GT 920M
3.7 GK210 Tesla K80
5.0 Maxwell GM107, GM108 GeForce GTX 750 Ti, Quadro K1200, Tesla M10
GeForce GTX 750, Quadro K2200,
GeForce GTX 960M, Quadro K620,
GeForce GTX 950M, Quadro M2000M,
GeForce 940M, Quadro M1000M,
GeForce 930M, Quadro M600M,
GeForce GTX 860M, Quadro K620M,
GeForce GTX 850M, NVS 810
GeForce 845M,
GeForce 840M,
GeForce 830M
GeForce GTX Titan X, Quadro M6000
GeForce GTX 980 Ti, 24GB, Quadro
GeForce GTX 980, M6000, Quadro
GeForce GTX 970, M5000, Quadro
GeForce GTX 960, M4000, Quadro Tesla M4, Tesla M40, Tesla
5.2 GM200, GM204, GM206
GeForce GTX 950, M2000, Quadro M6, Tesla M60
GeForce GTX 750 SE, M5500,
GeForce GTX 980M, Quadro M5000M,
GeForce GTX 970M, Quadro M4000M,
GeForce GTX 965M Quadro M3000M
Tegra X1,
Jetson TX1,
5.3 GM20B Jetson Nano,
DRIVE CX,
DRIVE PX
6.0 GP100 Quadro GP100 Tesla P100
Quadro P6000,
Quadro P5000,
Nvidia TITAN Xp, Titan
Quadro P4000,
X,
Quadro P2200,
GeForce GTX 1080 Ti,
Quadro P2000,
GTX 1080, GTX 1070
Quadro P1000,
Ti, GTX 1070, GTX
Quadro P400,
GP102, GP104, GP106, 1060,
6.1 Quadro P500, Tesla P40, Tesla P6, Tesla P4
Pascal GP107, GP108 GTX 1050 Ti, GTX
Quadro P520,
1050, GT 1030, GT
Quadro P600,
1010,
Quadro P5000
MX350, MX330,
(mobile), Quadro
MX250, MX230,
P4000 (mobile),
MX150, MX130, MX110
Quadro P3000
(mobile)
Tegra X2, Jetson TX2,
6.2 GP10B[52] DRIVE PX 2
7.0 GV100 NVIDIA TITAN V Quadro GV100 Tesla V100, Tesla V100S
Tegra Xavier,
Volta GV10B[53] Jetson Xavier NX,
Jetson AGX Xavier,
7.2 GV11B[54][55] DRIVE AGX Xavier,
DRIVE AGX Pegasus,
Clara AGX
Quadro RTX
NVIDIA TITAN RTX,
8000, Quadro
GeForce RTX 2080 Ti,
RTX 6000,
RTX 2080 Super, RTX
Quadro RTX
2080, RTX 2070 Super,
5000, Quadro
RTX 2070, RTX 2060
RTX 4000, T1000,
TU102, TU104, TU106, Super, RTX 2060
7.5 Turing T600, T400 Tesla T4
TU116, TU117 12GB, RTX 2060,
T1200 (mobile),
GeForce GTX 1660 Ti,
T600 (mobile),
GTX 1660 Super, GTX
T500 (mobile),
1660, GTX 1650 Super,
Quadro T2000
GTX 1650, MX550,
(mobile), Quadro
MX450
T1000 (mobile)
8.0 GA100 A100 80GB, A100 40GB, A30
RTX A6000, RTX
GeForce RTX 3090 Ti, A5500, RTX
RTX 3090, RTX 3080 A5000, RTX
Ti, RTX 3080 12GB, A4500, RTX
RTX 3080, RTX 3070 A4000, RTX
GA102, GA103, GA104, Ti, RTX 3070, RTX A2000
8.6 A40, A16, A10, A2
GA106, GA107 3060 Ti, RTX 3060, RTX A5000
RTX 3050, RTX 3050 Ti (mobile), RTX
Ampere (mobile), RTX 3050 A4000 (mobile),
(mobile), RTX 2050 RTX A3000
(mobile), MX570 (mobile), RTX
A2000 (mobile)
Jetson Orin Nano,
Jetson Orin NX,
Jetson AGX Orin,
8.7 GA10B
DRIVE AGX Orin,
DRIVE AGX Pegasus OA,
Clara Holoscan
GeForce RTX 4090, RTX 6000 Ada,
RTX 4080 Super, RTX RTX 5880 Ada,
4080, RTX 4070 Ti RTX 5000 Ada,
Ada AD102, AD103, AD104, Super, RTX 4070 Ti, RTX 4500 Ada,
8.9 L40S, L40, L20, L4, L2
Lovelace[56] AD106, AD107 RTX 4070 Super, RTX RTX 4000 Ada,
4070, RTX 4060 Ti, RTX 4000 SFF,
RTX 4060, RTX 4050 RTX 3500 Ada
(mobile) (mobile)
9.0 Hopper GH100 H200, H100
10.0 GB100 B200, B100, GB200 (?)
10.1 G10 (?) GB10 (?)
GeForce RTX 5090,
Blackwell GB202, GB203, GB205,
12.0 RTX 5080, RTX 5070 B40
GB206, GB207
Ti, RTX 5070
Jetson Thor (?), AGX Thor
12.x (?)
(?), Drive Thor (?)
Compute Tegra,
Micro-
capability GPUs GeForce Quadro, NVS Tesla/Datacenter Jetson,
architecture
(version) DRIVE
* – OEM-only products
Hardware-accelerated async-copy
Hardware-accelerated split arrive/wait barrier
No Yes
Warp-level support for reduction ops
L2 cache residency management
DPX instructions for accelerated dynamic programming
Distributed shared memory
No Yes
Thread block cluster
Tensor memory accelerator (TMA) unit
3.5, 3.7, 5.x, 6.x, 9.0, 10.x,
Feature support (unlisted features are supported for all compute 1.0,1.1 1.2,1.3 2.x 3.0 3.2 7.5 8.x
7.0, 7.2 12.0
capabilities)
Compute capability (version)
[58]
Data types
Floating-point types
Storage Length
Used Length
Supported vector Bits Sign Exponent Mantissa
Data type Bits Comments
types (complete Bits Bits Bits
(single value)
vector)
E2M1 = FP4 e2m1x2 / e2m1x4 8 / 16 4 1 2 1
E2M3 = FP6
e2m3x2 / e2m3x4 16 / 32 6 1 2 3
variant
E3M2 = FP6
e3m2x2 / e3m2x4 16 / 32 6 1 3 2
variant
UE4M3 ue4m3 8 7 0 4 3 Used for scaling (E2M1 only)
E4M3 = FP8 e4m3 / e4m3x2 /
8 / 16 / 32 8 1 4 3
variant e4m3x4
E5M2 = FP8 e5m2 / e5m2x2 /
8 / 16 / 32 8 1 5 2 Exponent/range of FP16, fits into 8 bits
variant e5m2x4
Used for scaling (any FP4 or FP6 or FP8
UE8M0 ue8m0x2 16 8 0 8 0
format)
FP16 f16 / f16x2 16 / 32 16 1 5 10
BF16 bf16 / bf16x2 16 / 32 16 1 8 7 Exponent/range of FP32, fits into 16 bits
Exponent/range of FP32,
TF32 tf32 32 19 1 8 10
mantissa/precision of FP16
FP32 f32 / f32x2 32 / 64 32 1 8 23
FP64 f64 64 64 1 11 52
Version support
Tensor cores
FMA per
cycle per 7.5 7.5 8.6 8.6 8.9 8.9
tensor Supported since 7.0 7.2 8.0 8.7 9.0
Workstation Desktop Workstation Desktop Desktop Workstation
core[60]
1st 1st
For dense For sparse
Data Type Gen Gen? 2nd Gen (8x/SM) 3rd Gen (4x/SM) 4th Gen (4x/SM)
matrices matrices
(8x/SM) (8x/SM)
1-bit values 8.0 as
No
(AND) experimental
No 4096 2048
1-bit values
1024
(XOR) 7.5–8.9 as
No Deprec
4-bit experimental 8.0–8.9 as
256 1024 512
integers experimental
4-bit
floating
10.0 No 4
point FP4
(E2M1)
6-bit
floating
point FP6 10.0 No 2
(E3M2 and
E2M3)
8-bit
7.2 8.0 No 128 128 512 256
integers
8-bit
floating
point FP8
(E4M3 and 256
E5M2) with
FP16
1024 2
accumulate
8.9 No
8-bit
floating
point FP8
(E4M3 and 128
E5M2) with
FP32
accumulate
16-bit
floating
point FP16 64 128
with FP16
accumulate
7.0 8.0 64 64
16-bit
floating
point FP16 256 512 1
with FP32
accumulate
64 128
16-bit
floating
32
point BF16 64[62]
with FP32
accumulate
7.5[61] 8.0
32-bit (19
No speed tbd
bits used)
128 32 64 256
floating (32?)[62]
point TF32
64-bit
floating 8.0 No No 16 speed tbd 32
point
Note: Any missing lines or empty entries do reflect some lack of information on that exact item.[63][64] [65] [66] [67] [68]
Tensor Core Composition 7.0 7.2, 7.5 8.0, 8.6 8.7 8.9 9.0
[69][70][71][72] 4 (8) 8 (16) 4 (8) 16 (32)
Dot Product Unit Width in FP16 units (in bytes)
FP Tensor Cores: Minimum Matrix Shape for full throughput (Bytes)[75] 2048
Technical specification
Technical 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 6.2 7.0 7.2 7.5
specifications Compute capability (version)
[86]
[87]
Multiprocessor architecture
Number of
special
function units
for single-
precision 2[96] 4 8 32 16
floating-point
transcendental
functions
Number of 4 8 8 per 2
texture per per 8 per 3
mapping units 2 2
/
SM
4 4 / 8[94] 16 8 16 8
(TMU) SM SM 3SM[94]
Number of
ALU lanes for
uniform INT32 No
arithmetic
operations
Number of
No
tensor cores
Number of
raytracing No
cores
Number of SM
Partitions =
Processing 1 4 2
Blocks[99]
Number of
warp
1 2 4
schedulers per
SM partition
Max number of
new
instructions
issued each 2[101] 1 2[102] 2
cycle by a
single
scheduler[100]
Size of unified
memory for 64 KiB SM + 96 KiB SM + 64 KiB SM + 64 KiB SM +
16 128
data cache 16 KiB[103] 64 KiB 24 KiB L1 24 KiB L1 24 KiB L1 24 KiB L1
KiB[103] KiB
and shared (separate)[104] (separate)[104] (separate)[104] (separate)[104]
memory
Size of L3
instruction 32
cache per KiB[106]
GPU
Size of L2 use L
instruction
cache per
8 KiB
Texture
Processor
Cluster (TPC)
Size of L1.5
instruction 32
cache per KiB
32 KiB 48 KiB[84] 128 KiB
SM[107] 4 KiB
Size of L1
instruction 8 KiB 8K
cache per SM
Size of L0
instruction
only 1 partition per SM No
cache per SM
partition
Number of
Render Output
Units (ROP)
per memory 4 8 4 8 16 8 12
partition (or
per GPC in
later models)
Architecture 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0
specifications Compute capability (version
[115]
Whereas Nvidia's CUDA is closed-source, Intel's OneAPI and AMD's ROCm are open source.
Intel OneAPI
oneAPI is an initiative based in open standards, created to support software development for multiple hardware architectures.[119] The oneAPI libraries must
implement open specifications that are discussed publicly by the Special Interest Groups, offering the possibility for any developer or organization to implement
their own versions of oneAPI libraries.[120][121]
Originally made by Intel, other hardware adopters include Fujitsu and Huawei.
AMD ROCm
ROCm[123] is an open source software stack for graphics processing unit (GPU) programming from Advanced Micro Devices (AMD).
See also
SYCL – an open standard from Khronos Group for programming a variety of platforms, including GPUs, with single-source modern C++,
similar to higher-level CUDA Runtime API (single-source)
BrookGPU – the Stanford University graphics group's compiler
Array programming
Parallel computing
Stream processing
rCUDA – an API for computing on remote computers
Molecular modeling on GPUs
Vulkan – low-level, high-performance 3D graphics and computing API
OptiX – ray tracing API by NVIDIA
CUDA binary (cubin) – a type of fat binary
Numerical Library Collection – by NEC for their vector processor
References
1. "NVIDIA® CUDA™ Unleashes Power of GPU Computing - Press 9. Witt, Stephen (2023-11-27). "How Jensen Huang's Nvidia Is
Release" (https://web.archive.org/web/20070329144655/http://ww Powering the A.I. Revolution" (https://www.newyorker.com/magazin
w.nvidia.com/object/IO_39918.html). nvidia.com. Archived from the e/2023/12/04/how-jensen-huangs-nvidia-is-powering-the-ai-revoluti
original (http://www.nvidia.com/object/IO_39918.html) on 29 March on). The New Yorker. ISSN 0028-792X (https://search.worldcat.org/
2007. Retrieved 26 January 2025. issn/0028-792X). Retrieved 2023-12-10.
2. Shah, Agam. "Nvidia not totally against third parties making CUDA 10. "CUDA LLVM Compiler" (https://developer.nvidia.com/cuda-llvm-co
chips" (https://www.theregister.com/2021/11/10/nvidia_cuda_silico mpiler). 7 May 2012.
n/). www.theregister.com. Retrieved 2024-04-25. 11. First OpenCL demo on a GPU (https://www.youtube.com/watch?v=r
3. "Nvidia CUDA Home Page" (https://developer.nvidia.com/cuda-zon 1sN1ELJfNo) on YouTube
e). 18 July 2017. 12. DirectCompute Ocean Demo Running on Nvidia CUDA-enabled
4. Shimpi, Anand Lal; Wilson, Derek (November 8, 2006). "Nvidia's GPU (https://www.youtube.com/watch?v=K1I4kts5mqc) on
GeForce 8800 (G80): GPUs Re-architected for DirectX 10" (https:// YouTube
www.anandtech.com/show/2116/8). AnandTech. Retrieved May 16, 13. Vasiliadis, Giorgos; Antonatos, Spiros; Polychronakis, Michalis;
2015. Markatos, Evangelos P.; Ioannidis, Sotiris (September 2008).
5. "Introduction — nsight-visual-studio-edition 12.6 documentation" (ht "Gnort: High Performance Network Intrusion Detection Using
tps://docs.nvidia.com/nsight-visual-studio-edition/introduction/index. Graphics Processors" (http://www.ics.forth.gr/dcs/Activities/papers/
html#cuda-debugger). docs.nvidia.com. Retrieved 2024-10-10. gnort.raid08.pdf) (PDF). Recent Advances in Intrusion Detection.
6. Abi-Chahla, Fedy (June 18, 2008). "Nvidia's CUDA: The End of the Lecture Notes in Computer Science. Vol. 5230. pp. 116–134.
CPU?" (https://www.tomshardware.com/reviews/nvidia-cuda-gpu,19 doi:10.1007/978-3-540-87403-4_7 (https://doi.org/10.1007%2F978-
54.html). Tom's Hardware. Retrieved May 17, 2015. 3-540-87403-4_7). ISBN 978-3-540-87402-7.
7. Zunitch, Peter (2018-01-24). "CUDA vs. OpenCL vs. OpenGL" (http 14. Schatz, Michael C.; Trapnell, Cole; Delcher, Arthur L.; Varshney,
s://www.videomaker.com/article/c15/19313-cuda-vs-opencl-vs-open Amitabh (2007). "High-throughput sequence alignment using
gl). Videomaker. Retrieved 2018-09-16. Graphics Processing Units" (https://www.ncbi.nlm.nih.gov/pmc/articl
es/PMC2222658). BMC Bioinformatics. 8: 474. doi:10.1186/1471-
8. "OpenCL" (https://developer.nvidia.com/opencl). NVIDIA Developer.
2105-8-474 (https://doi.org/10.1186%2F1471-2105-8-474).
2013-04-24. Retrieved 2019-11-04.
PMC 2222658 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2222
658). PMID 18070356 (https://pubmed.ncbi.nlm.nih.gov/18070356).
15. Manavski, Svetlin A.; Giorgio, Valle (2008). "CUDA compatible GPU 36. "PyCUDA" (http://mathema.tician.de/software/pycuda).
cards as efficient hardware accelerators for Smith-Waterman 37. "pycublas" (https://web.archive.org/web/20090420124748/http://ker
sequence alignment" (https://www.ncbi.nlm.nih.gov/pmc/articles/PM ed.org/blog/2009-04-13/easy-python-numpy-cuda-cublas/).
C2323659). BMC Bioinformatics. 10 (Suppl 2): S10. Archived from the original (http://kered.org/blog/2009-04-13/easy-py
doi:10.1186/1471-2105-9-S2-S10 (https://doi.org/10.1186%2F1471- thon-numpy-cuda-cublas/) on 2009-04-20. Retrieved 2017-08-08.
2105-9-S2-S10). PMC 2323659 (https://www.ncbi.nlm.nih.gov/pmc/
38. "CuPy" (https://cupy.dev/). Retrieved 2020-01-08.
articles/PMC2323659). PMID 18387198 (https://pubmed.ncbi.nlm.ni
h.gov/18387198). 39. "NVIDIA CUDA Programming Guide. Version 1.0" (http://developer.
download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programmi
16. "Pyrit – Google Code" (https://code.google.com/p/pyrit/).
ng_Guide_1.0.pdf) (PDF). June 23, 2007.
17. "Use your Nvidia GPU for scientific computing" (https://web.archive.
40. "NVIDIA CUDA Programming Guide. Version 2.1" (http://developer.
org/web/20081228022142/http://boinc.berkeley.edu/cuda.php).
download.nvidia.com/compute/cuda/2_1/toolkit/docs/NVIDIA_CUD
BOINC. 2008-12-18. Archived from the original (http://boinc.berkele A_Programming_Guide_2.1.pdf) (PDF). December 8, 2008.
y.edu/cuda.php) on 2008-12-28. Retrieved 2017-08-08.
41. "NVIDIA CUDA Programming Guide. Version 2.2" (http://developer.
18. "Nvidia CUDA Software Development Kit (CUDA SDK) – Release
download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUD
Notes Version 2.0 for MAC OS X" (https://web.archive.org/web/200
A_Programming_Guide_2.2.pdf) (PDF). April 2, 2009.
90106020401/http://developer.download.nvidia.com/compute/cuda/
sdk/website/doc/CUDA_SDK_release_notes_macosx.txt). Archived 42. "NVIDIA CUDA Programming Guide. Version 2.2.1" (http://develope
from the original (http://developer.download.nvidia.com/compute/cu r.download.nvidia.com/compute/cuda/2_21/toolkit/docs/NVIDIA_CU
da/sdk/website/doc/CUDA_SDK_release_notes_macosx.txt) on DA_Programming_Guide_2.2.1.pdf) (PDF). May 26, 2009.
2009-01-06. 43. "NVIDIA CUDA Programming Guide. Version 2.3.1" (http://develope
19. "CUDA 1.1 – Now on Mac OS X" (https://web.archive.org/web/2008 r.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUD
1122105633/http://news.developer.nvidia.com/2008/02/cuda-11---n A_Programming_Guide_2.3.pdf) (PDF). August 26, 2009.
ow-o.html). February 14, 2008. Archived from the original (http://ne 44. "NVIDIA CUDA Programming Guide. Version 3.0" (http://developer.
ws.developer.nvidia.com/2008/02/cuda-11---now-o.html) on download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUD
November 22, 2008. A_ProgrammingGuide.pdf) (PDF). February 20, 2010.
20. "CUDA 11 Features Revealed" (https://developer.nvidia.com/blog/c 45. "NVIDIA CUDA C Programming Guide. Version 3.1.1" (http://develo
uda-11-features-revealed/). 14 May 2020. per.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_C
21. "CUDA Toolkit 11.1 Introduces Support for GeForce RTX 30 Series UDA_C_ProgrammingGuide_3.1.pdf) (PDF). July 21, 2010.
and Quadro RTX Series GPUs" (https://developer.nvidia.com/blog/c 46. "NVIDIA CUDA C Programming Guide. Version 3.2" (http://develope
uda-11-1-introduces-support-rtx-30-series/). 23 September 2020. r.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_
22. "Enhancing Memory Allocation with New NVIDIA CUDA 11.2 C_Programming_Guide.pdf) (PDF). November 9, 2010.
Features" (https://developer.nvidia.com/blog/enhancing-memory-all 47. "CUDA 11.0 Release Notes" (https://docs.nvidia.com/cuda/archive/
ocation-with-new-cuda-11-2-features/). 16 December 2020. 11.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
23. "Exploring the New Features of CUDA 11.3" (https://developer.nvidi 48. "CUDA 11.1 Release Notes" (https://docs.nvidia.com/cuda/archive/
a.com/blog/exploring-the-new-features-of-cuda-11-3/). 16 April 11.1.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
2021. 49. "CUDA 11.5 Release Notes" (https://docs.nvidia.com/cuda/archive/
24. Silberstein, Mark; Schuster, Assaf; Geiger, Dan; Patney, Anjul; 11.5.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
Owens, John D. (2008). "Efficient computation of sum-products on 50. "CUDA 11.8 Release Notes" (https://docs.nvidia.com/cuda/archive/
GPUs through software-managed cache" (https://escholarship.org/c 11.8.0/cuda-toolkit-release-notes/index.html). NVIDIA Developer.
ontent/qt8js4v3f7/qt8js4v3f7.pdf?t=ptt3te) (PDF). Proceedings of 51. "NVIDIA Quadro NVS 420 Specs" (https://www.techpowerup.com/g
the 22nd annual international conference on Supercomputing – ICS pu-specs/quadro-nvs-420.c1448). TechPowerUp GPU Database.
'08 (https://escholarship.org/content/qt8js4v3f7/qt8js4v3f7.pdf?t=ptt 25 August 2023.
3te) (PDF). Proceedings of the 22nd annual international
52. Larabel, Michael (March 29, 2017). "NVIDIA Rolls Out Tegra X2
conference on Supercomputing – ICS '08. pp. 309–318.
GPU Support In Nouveau" (http://www.phoronix.com/scan.php?pag
doi:10.1145/1375527.1375572 (https://doi.org/10.1145%2F137552
7.1375572). ISBN 978-1-60558-158-3. e=news_item&px=Tegra-X2-Nouveau-Support). Phoronix.
Retrieved August 8, 2017.
25. "CUDA C Programming Guide v8.0" (http://docs.nvidia.com/cuda/pd
53. Nvidia Xavier Specs (https://www.techpowerup.com/gpudb/3232/xa
f/CUDA_C_Programming_Guide.pdf) (PDF). nVidia Developer
vier) on TechPowerUp (preliminary)
Zone. January 2017. p. 19. Retrieved 22 March 2017.
54. "Welcome — Jetson LinuxDeveloper Guide 34.1 documentation" (h
26. "NVCC forces c++ compilation of .cu files" (https://devtalk.nvidia.co
ttps://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%
m/default/topic/508479/cuda-programming-and-performance/nvcc-f
20Driver%20Package%20Development%20Guide/power_manage
orces-c-compilation-of-cu-files/#entry1340190). 29 November 2011.
ment_jetson_xavier.html).
27. Whitehead, Nathan; Fit-Florea, Alex. "Precision & Performance:
55. "NVIDIA Bringing up Open-Source Volta GPU Support for Their
Floating Point and IEEE 754 Compliance for Nvidia GPUs" (https://
Xavier SoC" (https://www.phoronix.com/scan.php?page=news_item
developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-C
&px=NVIDIA-Nouveau-GV11B-Volta-Xav).
UDA-Floating-Point.pdf) (PDF). Nvidia. Retrieved November 18,
2014. 56. "NVIDIA Ada Lovelace Architecture" (https://www.nvidia.com/en-us/
geforce/ada-lovelace-architecture/).
28. "CUDA-Enabled Products" (https://www.nvidia.com/object/cuda_lea
rn_products.html). CUDA Zone. Nvidia Corporation. Retrieved 57. Dissecting the Turing GPU Architecture through Microbenchmarking
2008-11-03. (https://developer.download.nvidia.com/video/gputechconf/gtc/201
9/presentation/s9839-discovering-the-turing-t4-gpu-architecture-wit
29. "Coriander Project: Compile CUDA Codes To OpenCL, Run
h-microbenchmarks.pdf)
Everywhere" (http://www.phoronix.com/scan.php?page=news_item
&px=CUDA-On-CL-Coriander). Phoronix. 58. "H.1. Features and Technical Specifications – Table 13. Feature
30. Perkins, Hugh (2017). "cuda-on-cl" (http://www.iwocl.org/wp-conten Support per Compute Capability" (https://docs.nvidia.com/cuda/cud
a-c-programming-guide/index.html#features-and-technical-specifica
t/uploads/iwocl2017-hugh-perkins-cuda-cl.pdf) (PDF). IWOCL.
tions). docs.nvidia.com. Retrieved 2020-09-23.
Retrieved August 8, 2017.
59. "CUDA C++ Programming Guide" (https://docs.nvidia.com/cuda/cud
31. "hughperkins/coriander: Build NVIDIA® CUDA™ code for
a-c-programming-guide/index.html#features-and-technical-specifica
OpenCL™ 1.2 devices" (https://github.com/hughperkins/coriander).
tions).
GitHub. May 6, 2019.
60. Fused-Multiply-Add, actually executed, Dense Matrix
32. "CU2CL Documentation" (http://chrec.cs.vt.edu/cu2cl/documentatio
n.php). chrec.cs.vt.edu. 61. as SASS since 7.5, as PTX since 8.0
33. "GitHub – vosen/ZLUDA" (https://github.com/vosen/ZLUDA). 62. unofficial support in SASS
GitHub. 63. "Technical brief. NVIDIA Jetson AGX Orin Series" (https://www.nvidi
34. Larabel, Michael (2024-02-12), "AMD Quietly Funded A Drop-In a.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-
CUDA Implementation Built On ROCm: It's Now Open-Source" (http agx-orin-technical-brief.pdf) (PDF). nvidia.com. Retrieved
s://www.phoronix.com/review/radeon-cuda-zluda), Phoronix, 5 September 2023.
retrieved 2024-02-12
35. "GitHub – chip-spv/chipStar" (https://github.com/chip-spv/chipStar).
GitHub.
64. "NVIDIA Ampere GA102 GPU Architecture" (https://images.nvidia.c 85. NVIDIA H100 Tensor Core GPU Architecture (https://nvdam.widen.n
om/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere- et/s/5bx55xfnxf/gtc22-whitepaper-hopper)
GA102-GPU-Architecture-Whitepaper-V1.pdf) (PDF). nvidia.com. 86. H.1. Features and Technical Specifications – Table 14. Technical
Retrieved 5 September 2023. Specifications per Compute Capability (https://docs.nvidia.com/cud
65. Luo, Weile; Fan, Ruibo; Li, Zeyu; Du, Dayou; Wang, Qiang; Chu, a/cuda-c-programming-guide/index.html#features-and-technical-sp
Xiaowen (2024). "Benchmarking and Dissecting the Nvidia Hopper ecifications)
GPU Architecture". arXiv:2402.13499v1 (https://arxiv.org/abs/2402. 87. NVIDIA Hopper Architecture In-Depth (https://developer.nvidia.com/
13499v1) [cs.AR (https://arxiv.org/archive/cs.AR)]. blog/nvidia-hopper-architecture-in-depth)
66. "Datasheet NVIDIA A40" (https://images.nvidia.com/content/Solutio 88. can only execute 160 integer instructions according to programming
ns/data-center/a40/nvidia-a40-datasheet.pdf) (PDF). nvidia.com. guide
Retrieved 27 April 2024.
89. 128 according to [1] (https://docs.nvidia.com/cuda/cuda-c-program
67. "NVIDIA AMPERE GA102 GPU ARCHITECTURE" (https://www.nvi ming-guide/index.html#arithmetic-instructions). 64 from FP32 + 64
dia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-white separate units?
paper-v2.1.pdf) (PDF). 27 April 2024.
90. 64 by FP32 cores and 64 by flexible FP32/INT cores.
68. "Datasheet NVIDIA L40" (https://www.nvidia.com/content/dam/en-z
91. "CUDA C++ Programming Guide" (https://docs.nvidia.com/cuda/cud
z/Solutions/design-visualization/support-guide/NVIDIA-L40-Datashe
a-c-programming-guide/index.html#arithmetic-instructions).
et-January-2023.pdf) (PDF). 27 April 2024.
92. 32 FP32 lanes combine to 16 FP64 lanes. Maybe lower depending
69. In the Whitepapers the Tensor Core cube diagrams represent the
on model.
Dot Product Unit Width into the height (4 FP16 for Volta and Turing,
8 FP16 for A100, 4 FP16 for GA102, 16 FP16 for GH100). The 93. only supported by 16 FP32 lanes, they combine to 4 FP64 lanes
other two dimensions represent the number of Dot Product Units 94. depending on model
(4x4 = 16 for Volta and Turing, 8x4 = 32 for Ampere and Hopper). 95. Effective speed, probably over FP32 ports. No description of actual
The resulting gray blocks are the FP16 FMA operations per cycle. FP64 cores.
Pascal without Tensor core is only shown for speed comparison as 96. Can also be used for integer additions and comparisons
is Volta V100 with non-FP16 datatypes.
97. 2 clock cycles/instruction for each SM partition Burgess, John
70. "NVIDIA Turing Architecture Whitepaper" (https://images.nvidia.co (2019). "RTX ON – The NVIDIA TURING GPU" (https://ieeexplore.i
m/aem-dam/en-zz/Solutions/design-visualization/technologies/turin eee.org/document/8875651). 2019 IEEE Hot Chips 31 Symposium
g-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf) (PDF). (HCS). pp. 1–27. doi:10.1109/HOTCHIPS.2019.8875651 (https://do
nvidia.com. Retrieved 5 September 2023. i.org/10.1109%2FHOTCHIPS.2019.8875651). ISBN 978-1-7281-
71. "NVIDIA Tensor Core GPU" (https://www.nvidia.com/content/dam/e 2089-8. S2CID 204822166 (https://api.semanticscholar.org/CorpusI
n-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvid D:204822166).
ia-1758950-r4-web.pdf) (PDF). nvidia.com. Retrieved 5 September 98. Durant, Luke; Giroux, Olivier; Harris, Mark; Stam, Nick (May 10,
2023. 2017). "Inside Volta: The World's Most Advanced Data Center
72. "NVIDIA Hopper Architecture In-Depth" (https://developer.nvidia.co GPU" (https://devblogs.nvidia.com/inside-volta/). Nvidia developer
m/blog/nvidia-hopper-architecture-in-depth/). 22 March 2022. blog.
73. shape x converted operand size, e.g. 2 tensor cores x 99. The schedulers and dispatchers have dedicated execution units
4x4x4xFP16/cycle = 256 Bytes/cycle unlike with Fermi and Kepler.
74. = product first 3 table rows 100. Dispatching can overlap concurrently, if it takes more than one
75. = product of previous 2 table rows; shape: e.g. 8x8x4xFP16 = 512 cycle (when there are less execution units than 32/SM Partition)
Bytes 101. Can dual issue MAD pipe and SFU pipe
76. Sun, Wei; Li, Ang; Geng, Tong; Stuijk, Sander; Corporaal, Henk 102. No more than one scheduler can issue 2 instructions at once. The
(2023). "Dissecting Tensor Cores via Microbenchmarks: Latency, first scheduler is in charge of warps with odd IDs. The second
Throughput and Numeric Behaviors". IEEE Transactions on Parallel scheduler is in charge of warps with even IDs.
and Distributed Systems. 34 (1): 246–261. arXiv:2206.02874 (http 103. shared memory only, no data cache
s://arxiv.org/abs/2206.02874). doi:10.1109/tpds.2022.3217824 (http
104. shared memory separate, but L1 includes texture cache
s://doi.org/10.1109%2Ftpds.2022.3217824). S2CID 249431357 (htt
ps://api.semanticscholar.org/CorpusID:249431357). 105. "H.6.1. Architecture" (https://docs.nvidia.com/cuda/cuda-c-program
ming-guide/index.html#architecture-7-x). docs.nvidia.com.
77. "Parallel Thread Execution ISA Version 7.7" (https://docs.nvidia.co
Retrieved 2019-05-13.
m/cuda/parallel-thread-execution/index.html#warp-level-matrix-instr
uctions-mma). 106. "Demystifying GPU Microarchitecture through Microbenchmarking"
(https://www.stuffedcow.net/files/gpuarch-ispass2010.pdf) (PDF).
78. Raihan, Md Aamir; Goli, Negar; Aamodt, Tor (2018). "Modeling
Deep Learning Accelerator Enabled GPUs". arXiv:1811.08309 (http 107. Jia, Zhe; Maggioni, Marco; Staiger, Benjamin; Scarpazza, Daniele
s://arxiv.org/abs/1811.08309) [cs.MS (https://arxiv.org/archive/cs.M P. (2018). "Dissecting the NVIDIA Volta GPU Architecture via
S)]. Microbenchmarking". arXiv:1804.06826 (https://arxiv.org/abs/1804.
06826) [cs.DC (https://arxiv.org/archive/cs.DC)].
79. "NVIDIA Ada Lovelace Architecture" (https://www.nvidia.com/en-gb/
geforce/ada-lovelace-architecture). 108. Jia, Zhe; Maggioni, Marco; Smith, Jeffrey; Daniele Paolo Scarpazza
(2019). "Dissecting the NVidia Turing T4 GPU via
80. Jia, Zhe; Maggioni, Marco; Smith, Jeffrey; Daniele Paolo Scarpazza
Microbenchmarking". arXiv:1903.07486 (https://arxiv.org/abs/1903.
(2019). "Dissecting the NVidia Turing T4 GPU via
07486) [cs.DC (https://arxiv.org/archive/cs.DC)].
Microbenchmarking". arXiv:1903.07486 (https://arxiv.org/abs/1903.
07486) [cs.DC (https://arxiv.org/archive/cs.DC)]. 109. "Dissecting the Ampere GPU Architecture through
81. Burgess, John (2019). "RTX ON – The NVIDIA TURING GPU" (http Microbenchmarking" (https://www.nvidia.com/en-us/on-demand/ses
sion/gtcspring21-s33322/).
s://ieeexplore.ieee.org/document/8875651). 2019 IEEE Hot Chips
31 Symposium (HCS). pp. 1–27. 110. Note that Jia, Zhe; Maggioni, Marco; Smith, Jeffrey; Daniele Paolo
doi:10.1109/HOTCHIPS.2019.8875651 (https://doi.org/10.1109%2F Scarpazza (2019). "Dissecting the NVidia Turing T4 GPU via
HOTCHIPS.2019.8875651). ISBN 978-1-7281-2089-8. Microbenchmarking". arXiv:1903.07486 (https://arxiv.org/abs/1903.
S2CID 204822166 (https://api.semanticscholar.org/CorpusID:20482 07486) [cs.DC (https://arxiv.org/archive/cs.DC)]. disagrees and
2166). states 2 KiB L0 instruction cache per SM partition and 16 KiB L1
instruction cache per SM
82. Burgess, John (2019). "RTX ON – The NVIDIA TURING GPU" (http
s://ieeexplore.ieee.org/document/8875651). 2019 IEEE Hot Chips 111. "asfermi Opcode" (https://github.com/hyqneuron/asfermi/wiki/Opcod
31 Symposium (HCS). pp. 1–27. e). GitHub.
doi:10.1109/HOTCHIPS.2019.8875651 (https://doi.org/10.1109%2F 112. for access with texture engine only
HOTCHIPS.2019.8875651). ISBN 978-1-7281-2089-8. 113. 25% disabled on RTX 4060, RTX 4070, RTX 4070 Ti and RTX 4090
S2CID 204822166 (https://api.semanticscholar.org/CorpusID:20482
114. 25% disabled on RTX 5070 Ti and RTX 5090
2166).
115. "I.7. Compute Capability 8.x" (https://docs.nvidia.com/cuda/cuda-c-
83. dependent on device
programming-guide/index.html#compute-capability-8-x).
84. "Tegra X1" (https://developer.nvidia.com/content/tegra-x1). 9 docs.nvidia.com. Retrieved 2022-10-12.
January 2015.
116. "Appendix F. Features and Technical Specifications" (http://develop 120. "Specifications | oneAPI" (https://www.oneapi.io/spec/). oneAPI.io.
er.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_ Retrieved 2024-07-27.
C_Programming_Guide.pdf) (PDF). (3.2 MiB), page 148 of 175 121. "oneAPI Specification — oneAPI Specification 1.3-rev-1
(Version 5.0 October 2012). documentation" (https://oneapi-spec.uxlfoundation.org/specification
117. "nVidia CUDA Bioinformatics: BarraCUDA" (https://www.biocentric. s/oneapi/v1.3-rev-1/). oneapi-spec.uxlfoundation.org. Retrieved
nl/biocentric/nvidia-cuda-bioinformatics-barracuda/). BioCentric. 2024-07-27.
2019-07-19. Retrieved 2019-10-15. 122. "Exclusive: Behind the plot to break Nvidia's grip on AI by targeting
118. "Part V: Physics Simulation" (https://developer.nvidia.com/gpugem software" (https://www.reuters.com/technology/behind-plot-break-nv
s/gpugems3/part-v-physics-simulation). NVIDIA Developer. idias-grip-ai-by-targeting-software-2024-03-25/). Reuters. Retrieved
Retrieved 2020-09-11. 2024-04-05.
119. "oneAPI Programming Model" (https://www.oneapi.io/). oneAPI.io. 123. "Question: What does ROCm stand for? · Issue #1628 ·
Retrieved 2024-07-27. RadeonOpenCompute/ROCm" (https://github.com/RadeonOpenCo
mpute/ROCm/issues/1628). Github.com. Retrieved January 18,
2022.
Further reading
Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Fatahalian, Kayvon; Houston, Mike; Hanrahan, Pat (2004-08-01). "Brook for GPUs:
stream computing on graphics hardware" (https://dl.acm.org/doi/10.1145/1015706.1015800). ACM Transactions on Graphics. 23 (3): 777–
786. doi:10.1145/1015706.1015800 (https://doi.org/10.1145%2F1015706.1015800). ISSN 0730-0301 (https://search.worldcat.org/issn/0730-0
301).
Nickolls, John; Buck, Ian; Garland, Michael; Skadron, Kevin (2008-03-01). "Scalable Parallel Programming with CUDA: Is CUDA the parallel
programming model that application developers have been waiting for?" (https://dl.acm.org/doi/10.1145/1365490.1365500). Queue. 6 (2): 40–
53. doi:10.1145/1365490.1365500 (https://doi.org/10.1145%2F1365490.1365500). ISSN 1542-7730 (https://search.worldcat.org/issn/1542-77
30).
External links
Official website (https://developer.nvidia.com/cuda-zone)