Scipy09 Pycuda Tut
Scipy09 Pycuda Tut
Thanks
SciPy2009 Organizers
Andreas Klockner (!)
PyCuda contributors
Nvidia Corporation
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Outline
1 Introduction
GPU Computing: Overview
Architectures and Programming Models
2 Programming GPUs
3 GPU Scripting
Outline
1 Introduction
GPU Computing: Overview
Architectures and Programming Models
2 Programming GPUs
3 GPU Scripting
Stream Processing?
Stream Processing?
Market Overview
Market Overview
Based on that:
Sony/Toshiba/IBM: Cell Broadband Engine
ATI: R580 and later
Nvidia: G80 and later
Intel: Larabee
Outline
1 Introduction
GPU Computing: Overview
Architectures and Programming Models
2 Programming GPUs
3 GPU Scripting
1 GPU = 30 MPs
1 MP = 1 ID (1/4 clock) +
8 SP + 1 DP +
16 KiB Shared +
32 KiB Reg + HW Sched
Scalar cores
max 512 threads/MP
Ded. RAM (140 GB/s)
PCIe2 Host DMA (6 GB/s)
Limited Caches
Unreleased (2010?)
x86-64 + SSE +
vector-complete 512-bit ISA (LRBni)
4x Hyperthreading
32 (?) cores per chip
Fiber/Strand software threads
Recursive Launches
Coherent Caches (w/ explicit control)
Performance?
Programming Models
Pixel Shaders?
DirectX?
OpenGL?
Programming Models
Pixel Shaders?
DirectX?
OpenGL?
Programming Models
Pixel Shaders?
DirectX?
OpenGL?
Programming Models
PTX
Cell
Assembly LRBni
Hardware-specific Abstract
Programming Models
Cell C
PTX
OpenCL
Cell
Assembly LRBni CUDA Brook+
Hardware-specific Abstract
Architecture Comparison
+ Multicore
+ Open Spec
- Hard: DMA
sched, Alignment,
Small LS
- HW Avail. ($)
- Mem BW
Architecture Comparison
Architecture Comparison
Architecture Comparison
Questions?
Outline
1 Introduction
2 Programming GPUs
Overview
Dealing with Memory
3 GPU Scripting
Outline
1 Introduction
2 Programming GPUs
Overview
Dealing with Memory
3 GPU Scripting
What is CUDA?
What is CUDA?
What is CUDA?
Gains Losses
+ Memory Bandwidth
(140 GB/s vs. 12 GB/s)
+ Compute Bandwidth
(Peak: 1 TF/s vs. 50 GF/s,
Real: 200 GF/s vs. 10 GF/s)
Gains Losses
Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block
(0, 0) (1, 0) (2, 0)
Block (1, 0)
Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Block (1, 0)
Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0)
Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0) Algorithm must work with blocks
Thread Thread Thread Thread executed in any order
(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2)
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)
Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0) Algorithm must work with blocks
Thread Thread Thread Thread executed in any order
(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread Grids and Blocks replace outer
(0, 2) (1, 2) (2, 2) (3, 2)
loops in an algorithm.
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)
Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0) Algorithm must work with blocks
Thread Thread Thread Thread executed in any order
(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread Grids and Blocks replace outer
(0, 2) (1, 2) (2, 2) (3, 2)
loops in an algorithm.
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Indices available at
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) run time
1 // GPUside
2
3 global void square array ( float a, int n)
4 {
5 int i = blockIdx.x blockDim.x + threadIdx.x;
6 if ( i < n)
7 a[ i ] = a[i ] a[ i ];
8 }
Typical errors:
GPUs have (some) memory protection Launch failure
Invalid sizes (block/grid/. . . )
Invisible Subtleties
Invisible Subtleties
Kernel Launches
Execution configuration:
dim3 grid size(gx, gy); // max 2D
dim3 block size(bx, by, bz); // max 3D
kernel <<<grid size, block size>>>(arg, ...);
Do not wait for completion.
Cheap! ( 2 s overhead)
CUDA compiler
driver (nvcc)
ptxas gcc/MSVC
Automatic
GPU Binary CPU Binary
.cubin .o
Fat Binary
.o/executable
Fat Binary
.o/executable
Driver / CUDA
Debugger / Runtime
Profiler Interface
GPU CPU
Hands-on Exercise
1 Edit 1-cuda-simple/simple.cu:
cudaSetDevice(Your GPU # );
2 Compile and run:
nvcc -o simple.x simple.cu
./simple.x
3 Add error checking to the example.
4 Modify simple.cu to print the contents of the result.
5 Modify simple.cu to compute ci = ai bi .
6 Modify simple.cu to use blocks of 16 16 threads.
Preliminary bits:
1 #include <stdio.h>
2
3 #define CUDA CHK(NAME, ARGS) { \
4 cudaError t cuda err code = NAME ARGS; \
5 if (cuda err code != cudaSuccess) { \
6 printf (%s failed with code %d\n, #NAME, cuda err code); \
7 abort (); \
8 }\
9 }
Allocating memory:
23 int main()
24 {
25 cudaSetDevice(0); // EDIT ME
26
27 const int n = 4096;
28
29 float a host = (float ) malloc(nsizeof( float ));
30 float b host = (float ) malloc(nsizeof( float ));
31
32 float a device , b device ;
33 CUDA CHK(cudaMalloc, ((void ) &a device, nsizeof(float)));
34 CUDA CHK(cudaMalloc, ((void ) &b device, nsizeof(float)));
Questions?
Outline
1 Introduction
2 Programming GPUs
Overview
Dealing with Memory
3 GPU Scripting
Memory Model
Global
Constant
Texture
Memory Model
Global
Constant
Texture
Memory Model
Global
Constant
Texture
Registers
Multiprocessor
Shared Memory
Registers Registers
32 KiB of registers per MP
Per-thread
Thread (0, 0) Thread (1, 0)
Latency: 1 clock
Variable amount per thread
Local Local
Register count limits max.
threads/MP
CPUs: Fixed register file () Global
Constant
Texture
Global Memory
Multiprocessor
Shared Memory
Registers Registers
Constant
Texture
Measuring Performance
Measuring Performance
Measuring Performance
3.0
Naive
Memory Bandwidth [GB/s]
2.5
2.0
1.5
1.0
0.5
0.0 6
10 107 108
Matrix size [Bytes]
Fantastic! About same as CPU. Why?
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Memory
Texture Memory
Multiprocessor
#define PREPARE \
cudaBindTexture(0, in tex , d idata , mem size); \
std :: swap(grid .x, grid .y ); \
std :: swap(threads.x, threads .y );
25
Naive
20
Textures
Memory Bandwidth [GB/s]
15
10
0
106 107 108
Matrix size [Bytes]
Better! But texture units cant quite hide wide data bus.
Need different idea.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Memory
Shared Memory
Multiprocessor
Shared Memory
Registers Registers
16 KiB of shared mem per MP
Per-block
Thread (0, 0) Thread (1, 0)
Latency: 2 clocks
Variable amount per block
Shared memory limits max. Local Local
blocks/MP
Global
Banked
Constant
Texture
Transpose: Idea
Idea
Dont transpose element-by-element.
Transpose block-by-block instead.
35
Naive
30 Textures
Shared
Memory Bandwidth [GB/s]
25
20
15
10
5
0
106 107 108
Matrix size [Bytes]
Important
Dont choose one type of memory.
Successful algorithms combine many types strengths.
Questions?
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA
Scripting Languages
Python:
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
uses run-time typing.
works well for gluing lower-level blocks together.
Scripting: Goals
Scripting: Speed
Questions?
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA
1 mod = cuda.SourceModule(
2 global void doublify ( float a) Compute kernel
3 {
4 int idx = threadIdx.x + threadIdx.y4;
5 a[ idx ] = 2;
6 }
7 )
8
9 func = mod.get function(doublify)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
1 import numpy
2 import pycuda.autoinit
3 from pycuda import gpuarray
4
5 a cpu = numpy.random.randn(4,4).astype(numpy.float32)
6 b cpu = numpy.random.randn(4,4).astype(numpy.float32)
7 c cpu = a cpu b cpu
8
9 a gpu = gpuarray.to gpu(a cpu)
10 b gpu = gpuarray.to gpu(b cpu)
11 c gpu = (a gpu b gpu).get()
12
13 print c cpu c gpu
Remember me?
1 // trivia
2 #include <stdio.h>
3
4 #define CUDA CHK(NAME, ARGS) { \
5 cudaError t cuda err code = NAME ARGS; \
6 if (cuda err code != cudaSuccess) { \ 1 // main2
7 printf (%s failed with code %d\n, #NAME, cuda err code); \ 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }
8 abort (); \ 3
9 }\ 4 CUDA CHK(cudaMemcpy, (a device, a host, nsizeof(float),
10 } 5 cudaMemcpyHostToDevice));
11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, nsizeof(float),
12 7 cudaMemcpyHostToDevice));
13 // kernel 8
14 global void square array ( float a, float b, int n) 9 dim3 block dim(16, 16);
15 { 10 int block size = block dim.xblock dim.y;
16 int i = ( blockIdx .x blockDim.y + threadIdx.y) 11 int n blocks = (n + block size1) / block size ;
17 blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n);
18 if ( i < n) 13 // end
19 a[ i ] = a[i ] b[i ]; 14
20 } 15 // main3
21 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, nsizeof(float),
22 17 cudaMemcpyDeviceToHost));
23 // main1 18
24 int main() 19 for ( int i = 0; i < n; i++)
25 { 20 printf (%.0f , a host [ i ]);
26 cudaSetDevice(0); // EDIT ME 21 puts(\n);
27 22
28 const int n = 4096; 23 free (a host );
29 24 CUDA CHK(cudaFree, (a device));
30 float a host = (float ) malloc(nsizeof( float )); 25 }
31 float b host = (float ) malloc(nsizeof( float )); 26 // end
32
33 float a device, b device;
34 CUDA CHK(cudaMalloc, ((void ) &a device, nsizeof(float)));
35 CUDA CHK(cudaMalloc, ((void ) &b device, nsizeof(float)));
36 // end
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA
PyCuda Philosophy
PyCuda: Completeness
For example:
Arrays and Textures
Pagelocked host memory
Memory transfers (asynchronous, structured)
Streams and Events
Device queries
(GL Interop)
PyCuda: Completeness
Linux
Windows
OS X
PyCuda: Workflow
Edit
Run
PyCuda: Workflow
Edit
Run
SourceModule("...")
PyCuda: Workflow
Edit
Run
SourceModule("...")
PyCuda
PyCuda: Workflow
Edit Cache?
Run
SourceModule("...")
PyCuda
PyCuda: Workflow
Edit Cache?
Run nvcc
SourceModule("...")
PyCuda
PyCuda: Workflow
Edit Cache?
SourceModule("...")
PyCuda
PyCuda: Workflow
Edit Cache!
SourceModule("...")
PyCuda
PyCuda: Workflow
Edit Cache!
PyCuda: Workflow
Edit Cache!
Run on GPU
mod = pycuda.driver.SourceModule(
global my func(float out, float in ){...} )
func = mod.get function(my func)
src = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.empty like(src)
my func(
cuda.Out(dest),
cuda.In( src ),
block=(400,1,1))
Automatic Cleanup
pycuda.gpuarray:
Meant to look and feel just like numpy.
gpuarray.to gpu(numpy array)
numpy array = gpuarray.get()
No: nd indexing, slicing, etc. (yet!)
Yes: +, -, , /, fill, sin, exp, rand, take, . . .
Random numbers using pycuda.curandom
Mixed types (int32 + float32 = float64)
print gpuarray for debugging.
Memory behind gpuarray available as .gpudata
attribute.
Use as kernel arguments, textures, etc.
http://mathema.tician.de/
software/pycuda
X Consortium License
(no warranty, free for all use)
Requires: numpy, Boost C++,
Python 2.4+.
Support via mailing list.
Questions?
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA
CUDA APIs
C/C++ Python
Driver API
Kernel Driver
Hardware
CUDA APIs
Questions?
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA
Human vs Machine
In PyCuda,
CUDA C code
does not need to
be a compile-time
constant.
Human vs Machine
In PyCuda,
CUDA C code
does not need to
be a compile-time
constant.
Human vs Machine
Idea
In PyCuda,
CUDA C code
does not need to
be a compile-time
constant.
Human vs Machine
Idea
In PyCuda,
Python Code CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc
.cubin
(unlike the CUDA Runtime API)
GPU
Result
Human vs Machine
Idea
In PyCuda,
Python Code CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc
.cubin Machine
(unlike the CUDA Runtime API)
GPU
Result
Human vs Machine
Idea
Human In PyCuda,
Python Code CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc
.cubin
(unlike the CUDA Runtime API)
GPU
Result
Human vs Machine
Idea
In PyCuda,
Python Code Easy to write CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc
.cubin
(unlike the CUDA Runtime API)
GPU
Result
Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
( register pressure)
Loop Unrolling
Questions?
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Goal
Instructions
1 cd 3-pycuda-matrixmul-simple
2 Edit matrixmul simple.py
3 Complete the TODOs.
Code
Initialization:
import numpy as np
from pycuda import driver , compiler , gpuarray, tools
import atexit
Code
Memory allocation and transfer:
# define the (square) matrix size
# note that we ll only use one block of threads here
# as a consequence this number (squared) cant exceed max threads,
# see http://documen.tician.de/pycuda/util .html#pycuda.tools.DeviceData
# for more information on how to get this number for your device
MATRIX SIZE = 2
Code
GPU code compilation and execution:
# by specifying the constant MATRIX SIZE
kernel code = kernel code template % {
MATRIX SIZE: MATRIX SIZE
}
Code
GPU kernel code:
kernel code template =
Code
GPU kernel code (solution):
kernel code template =
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Goal
Code
GPU kernel code:
kernel code template =
// Block index
const uint bx = blockIdx.x;
const uint by = blockIdx.y;
// Thread index
const uint tx = threadIdx.x;
const uint ty = threadIdx.y;
Code
GPU kernel code (contd):
// The element of the block submatrix that is computed
// by the thread
float Csub = 0;
// Loop over all the submatrices of A and B required to
// compute the block submatrix
for ( int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the submatrix of A
shared float As[%(BLOCK SIZE)s][%(BLOCK SIZE)s];
// Shared memory for the submatrix of B
shared float Bs[%(BLOCK SIZE)s][%(BLOCK SIZE)s];
Code
GPU kernel code (contd):
// Multiply the two matrices together ;
// each thread computes one element
// of the block submatrix
for ( int k = 0; k < %(BLOCK SIZE)s; ++k)
Csub += As[ty][k] Bs[k ][ tx ];
Outline
1 Introduction
2 Programming GPUs
3 GPU Scripting
Goal
Multiply two matrices together (any size).
Use global memory and shared memory.
Implement various optimizations:
different granularities of parallelism (block and work sizes),
loop unrolling,
register pressure (spilling),
pre-fetching (global memory load).
Instrumentalize the code using a template engine (Cheetah).
Auto-tune depending on the hardware and the input data.
Instructions
1 cd 5-pycuda-matrixmul-opt
2 Implement your auto-tuning function.
3 Use PyCuda to gather informations (registers, occupancy).
Code
Some numbers
Conclusions
Conclusions
Conclusions
Conclusions
Conclusions
Conclusions
Thank you