0% found this document useful (0 votes)
61 views162 pages

Scipy09 Pycuda Tut

This document outlines an introduction to GPU computing using PyCuda. It begins with an overview of GPU architectures and programming models. GPUs are designed for data parallelism and throughput rather than task parallelism like CPUs. GPU chips devote much more die area to stream processors than CPUs. Popular programming models for GPUs include CUDA, OpenCL, and graphics APIs like OpenGL. The document then discusses Cell/B.E., GPU, and Intel Larabee architectures and compares their strengths and weaknesses for parallel computing.

Uploaded by

Zubin Arya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views162 pages

Scipy09 Pycuda Tut

This document outlines an introduction to GPU computing using PyCuda. It begins with an overview of GPU architectures and programming models. GPUs are designed for data parallelism and throughput rather than task parallelism like CPUs. GPU chips devote much more die area to stream processors than CPUs. Popular programming models for GPUs include CUDA, OpenCL, and graphics APIs like OpenGL. The document then discusses Cell/B.E., GPU, and Intel Larabee architectures and compares their strengths and weaknesses for parallel computing.

Uploaded by

Zubin Arya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

Intro GPUs Scripting Hands-on

Programming GPUs with PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown)

SciPy Conference 2009 / Advanced Tutorial


http://conference.scipy.org/advanced_tutorials

August 19, 2009

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on

Thanks

SciPy2009 Organizers
Andreas Klockner (!)
PyCuda contributors
Nvidia Corporation

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Outline

1 Introduction
GPU Computing: Overview
Architectures and Programming Models

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Outline

1 Introduction
GPU Computing: Overview
Architectures and Programming Models

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Stream Processing?

Design target for CPUs:


Focus on Task parallelism
Make a single thread very fast
Hide latency through large caches
Predict, speculate

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Stream Processing?

Design target for CPUs:


Focus on Task parallelism
Make a single thread very fast
Hide latency through large caches
Predict, speculate
Stream Processing takes a different
approach:
Focus on Data parallelism
Throughput matters
single threads do not
Hide latency through parallelism
Let programmer deal with raw
memory hierarchy

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

CPU Chip Real Estate

Die floorplan: Intel Core i7 (2008).


45 nm, 4x4 SP ops at a time, 4x256KB L2, 8MB L3
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Architectures

GPU Chip Real Estate

Die floorplan: AMD RV770 (2008).


55 nm, 800 SP ops at a time.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Architectures

Market Overview

Quote Linus Torvalds:

Hardware that isnt mass market tends to not be worth it in


the long run.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Market Overview

Quote Linus Torvalds:

Hardware that isnt mass market tends to not be worth it in


the long run.

Based on that:
Sony/Toshiba/IBM: Cell Broadband Engine
ATI: R580 and later
Nvidia: G80 and later
Intel: Larabee

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Outline

1 Introduction
GPU Computing: Overview
Architectures and Programming Models

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Cell BE: Architecture

1 Cell BE = 1 dual-core Power + 8


SPEs + Bus
1 SPE = SPU + DMA + 256 KiB
Local Store
1 SPU = 128-bit Vector ALU
Bus = 200 GB/s Ring
Ded. RAM (25 GB/s)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

GPU: Architecture (e.g. Nvidia)

1 GPU = 30 MPs
1 MP = 1 ID (1/4 clock) +
8 SP + 1 DP +
16 KiB Shared +
32 KiB Reg + HW Sched
Scalar cores
max 512 threads/MP
Ded. RAM (140 GB/s)
PCIe2 Host DMA (6 GB/s)
Limited Caches

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Intel Larabee: Architecture

Unreleased (2010?)
x86-64 + SSE +
vector-complete 512-bit ISA (LRBni)
4x Hyperthreading
32 (?) cores per chip
Fiber/Strand software threads
Recursive Launches
Coherent Caches (w/ explicit control)
Performance?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Programming Models

Pixel Shaders?

DirectX?

OpenGL?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Programming Models

Pixel Shaders?

DirectX?

OpenGL?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Programming Models

Pixel Shaders?

DirectX?

OpenGL?

Dedicated Compute APIs


Not much graphicsy stuff visible

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Programming Models

PTX

Cell
Assembly LRBni

Hardware-specific Abstract

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Programming Models

Cell C
PTX
OpenCL
Cell
Assembly LRBni CUDA Brook+

Hardware-specific Abstract

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Architecture Comparison

Cell GPU Larabee

+ Multicore
+ Open Spec
- Hard: DMA
sched, Alignment,
Small LS
- HW Avail. ($)
- Mem BW

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Architecture Comparison

Cell GPU Larabee

+ Multicore + Available now


+ Open Spec + OpenCL
- Hard: DMA + Maturing (tools
sched, Alignment, etc.)
Small LS + Mem BW
- HW Avail. ($) o Reasonably easy
- Mem BW - Perf. opaque

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Architecture Comparison

Cell GPU Larabee

+ Multicore + Available now - Unavailable


+ Open Spec + OpenCL + Programmability?
- Hard: DMA + Maturing (tools o OpenCL Support
sched, Alignment, etc.) ( why wait?)
Small LS + Mem BW - Competitive with
- HW Avail. ($) o Reasonably easy next-gen GPU?
- Mem BW - Perf. opaque ( why specialize?)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Architecture Comparison

Cell GPU Larabee

+ Multicore + Available now - Unavailable


+ Open Spec + OpenCL + Programmability?
- Hard: DMA + Maturing (tools o OpenCL Support
sched, Alignment, etc.) ( why wait?)
Small LS + Mem BW - Competitive with
- HW Avail. ($) o Reasonably easy next-gen GPU?
- Mem BW - Perf. opaque ( why specialize?)
re
Fo cus He

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Architectures

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Outline

1 Introduction

2 Programming GPUs
Overview
Dealing with Memory

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Outline

1 Introduction

2 Programming GPUs
Overview
Dealing with Memory

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

What is CUDA?

CUDA is Nvidias proprietary compute abstraction.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

What is CUDA?

CUDA is Nvidias proprietary compute abstraction.


Main merit: A well-balanced model of GPU computing.
Abstract enough to not be hardware-specific.
Concrete enough to expose most hardware features.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

What is CUDA?

CUDA is Nvidias proprietary compute abstraction.


Main merit: A well-balanced model of GPU computing.
Abstract enough to not be hardware-specific.
Concrete enough to expose most hardware features.
(Very) close semantic relative of OpenCL.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Gains and Losses

Gains Losses

+ Memory Bandwidth
(140 GB/s vs. 12 GB/s)
+ Compute Bandwidth
(Peak: 1 TF/s vs. 50 GF/s,
Real: 200 GF/s vs. 10 GF/s)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Gains and Losses

Gains Losses

+ Memory Bandwidth - Recursion


(140 GB/s vs. 12 GB/s) - Function pointers
+ Compute Bandwidth - Exceptions
(Peak: 1 TF/s vs. 50 GF/s, - IEEE 754 FP compliance
Real: 200 GF/s vs. 10 GF/s) - Cheap branches (i.e. ifs)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPUs: Threading Model

Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block
(0, 0) (1, 0) (2, 0)

Block (1, 0)

Thread Thread Thread Thread


(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2)
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPUs: Threading Model

Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate

Block (1, 0)

Thread Thread Thread Thread


(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2)
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPUs: Threading Model

Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0)

Thread Thread Thread Thread


(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2)
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPUs: Threading Model

Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0) Algorithm must work with blocks
Thread Thread Thread Thread executed in any order
(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2)
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPUs: Threading Model

Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0) Algorithm must work with blocks
Thread Thread Thread Thread executed in any order
(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread Grids and Blocks replace outer
(0, 2) (1, 2) (2, 2) (3, 2)
loops in an algorithm.
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPUs: Threading Model

Multi-tiered Parallelism
Computational Grid
Grid of Blocks
Block Block Block
(0, 1) (1, 1) (2, 1) Block of Threads
Block Block Block Only threads within a block can
(0, 0) (1, 0) (2, 0) communicate
Each Block is assigned to a
physical execution unit.
Block (1, 0) Algorithm must work with blocks
Thread Thread Thread Thread executed in any order
(0, 3) (1, 3) (2, 3) (3, 3)
Thread Thread Thread Thread Grids and Blocks replace outer
(0, 2) (1, 2) (2, 2) (3, 2)
loops in an algorithm.
Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1)
Indices available at
Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) run time

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program

1 // GPUside
2
3 global void square array ( float a, int n)
4 {
5 int i = blockIdx.x blockDim.x + threadIdx.x;
6 if ( i < n)
7 a[ i ] = a[i ] a[ i ];
8 }

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program


12 int main() // CPUside
13 {
14 cudaSetDevice(0); // EDIT ME
15
16 int n = 4096; int bytes = nsizeof(float );
17 float a host = (float ) malloc(bytes );
18 for ( int i = 0; i < n; i ++) a host[i] = i;
19
20 float a device ;
21 cudaMalloc((void ) &a device, bytes );
22 cudaMemcpy(a device, a host, bytes , cudaMemcpyHostToDevice);
23
24 int block size = 256;
25 int n blocks = (n + block size1) / block size ;
26 square array <<<n blocks, block size>>>(a device, n);
27
28 free (a host ); cudaFree(a device );
29 }

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

A Dose of Reality: Error Checking

#define CUDA CHK(NAME, ARGS) { \


cudaError t cuda err code = NAME ARGS; \
if (cuda err code != cudaSuccess) { \
printf (%s failed with code %d\n, #NAME, cuda err code); \
abort (); \
}\
}

CUDA CHK(cudaMalloc, (&result, m sizesizeof(float)));

Typical errors:
GPUs have (some) memory protection Launch failure
Invalid sizes (block/grid/. . . )

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Invisible Subtleties

Host Pointer or Device Pointer?


float *h data = (float*) malloc(mem size);
float *d data;
CUDA CHK(cudaMalloc, ((void**) &d data, mem size));
Both kinds of pointer share the same data type!

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Invisible Subtleties

Host Pointer or Device Pointer?


float *h data = (float*) malloc(mem size);
float *d data;
CUDA CHK(cudaMalloc, ((void**) &d data, mem size));
Both kinds of pointer share the same data type!

Kernel Launches
Execution configuration:
dim3 grid size(gx, gy); // max 2D
dim3 block size(bx, by, bz); // max 3D
kernel <<<grid size, block size>>>(arg, ...);
Do not wait for completion.
Cheap! ( 2 s overhead)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

The CUDA Toolchain

.cu GPU+CPU code

CUDA compiler
driver (nvcc)

ptxas gcc/MSVC
Automatic
GPU Binary CPU Binary
.cubin .o

Fat Binary
.o/executable

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Executing CUDA Binaries

Fat Binary
.o/executable

Driver / CUDA
Debugger / Runtime
Profiler Interface

GPU CPU

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPU Demo Machines


Machine GPUs
iapcuda-01 Device 0: GeForce GTX 285
iapcuda-01 Device 1: Tesla C1060
iapcuda-01 Device 2: Tesla C1060
iapcuda-02 Device 0: GeForce GTX 295
iapcuda-02 Device 1: GeForce GTX 295
iapcuda-02 Device 2: Tesla C1060
iapcuda-02 Device 3: Tesla C1060

Prepare your workspace in one of our CUDA demo machine:


1 ssh [email protected]
(password: GpUh4cK3r)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPU Demo Machines


Machine GPUs
iapcuda-01 Device 0: GeForce GTX 285
iapcuda-01 Device 1: Tesla C1060
iapcuda-01 Device 2: Tesla C1060
iapcuda-02 Device 0: GeForce GTX 295
iapcuda-02 Device 1: GeForce GTX 295
iapcuda-02 Device 2: Tesla C1060
iapcuda-02 Device 3: Tesla C1060

Prepare your workspace in one of our CUDA demo machine:


1 ssh [email protected]
(password: GpUh4cK3r)
2 mkdir lastname.firstname

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPU Demo Machines


Machine GPUs
iapcuda-01 Device 0: GeForce GTX 285
iapcuda-01 Device 1: Tesla C1060
iapcuda-01 Device 2: Tesla C1060
iapcuda-02 Device 0: GeForce GTX 295
iapcuda-02 Device 1: GeForce GTX 295
iapcuda-02 Device 2: Tesla C1060
iapcuda-02 Device 3: Tesla C1060

Prepare your workspace in one of our CUDA demo machine:


1 ssh [email protected]
(password: GpUh4cK3r)
2 mkdir lastname.firstname
3 cd lastname.firstname

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

GPU Demo Machines


Machine GPUs
iapcuda-01 Device 0: GeForce GTX 285
iapcuda-01 Device 1: Tesla C1060
iapcuda-01 Device 2: Tesla C1060
iapcuda-02 Device 0: GeForce GTX 295
iapcuda-02 Device 1: GeForce GTX 295
iapcuda-02 Device 2: Tesla C1060
iapcuda-02 Device 3: Tesla C1060

Prepare your workspace in one of our CUDA demo machine:


1 ssh [email protected]
(password: GpUh4cK3r)
2 mkdir lastname.firstname
3 cd lastname.firstname
4 wget is.gd/2o40o && tar xzf scipy09-pycuda-tut.tar.gz
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Memory

Getting your feet wet

Hands-on Exercise
1 Edit 1-cuda-simple/simple.cu:
cudaSetDevice(Your GPU # );
2 Compile and run:
nvcc -o simple.x simple.cu
./simple.x
3 Add error checking to the example.
4 Modify simple.cu to print the contents of the result.
5 Modify simple.cu to compute ci = ai bi .
6 Modify simple.cu to use blocks of 16 16 threads.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program (Solution)

Preliminary bits:
1 #include <stdio.h>
2
3 #define CUDA CHK(NAME, ARGS) { \
4 cudaError t cuda err code = NAME ARGS; \
5 if (cuda err code != cudaSuccess) { \
6 printf (%s failed with code %d\n, #NAME, cuda err code); \
7 abort (); \
8 }\
9 }

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program (Solution)

The GPU kernel:


13 global void square array ( float a, float b, int n)
14 {
15 int i = (blockIdx .x blockDim.y + threadIdx.y)
16 blockDim.x + threadIdx.x;
17 if ( i < n)
18 a[ i ] = a[i ] b[ i ];
19 }

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program (Solution)

Allocating memory:
23 int main()
24 {
25 cudaSetDevice(0); // EDIT ME
26
27 const int n = 4096;
28
29 float a host = (float ) malloc(nsizeof( float ));
30 float b host = (float ) malloc(nsizeof( float ));
31
32 float a device , b device ;
33 CUDA CHK(cudaMalloc, ((void ) &a device, nsizeof(float)));
34 CUDA CHK(cudaMalloc, ((void ) &b device, nsizeof(float)));

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program (Solution)

Transfer and Launch:


38 for (int i = 0; i < n; i ++) { a host[i ] = i; b host [ i ] = i+1; }
39
40 CUDA CHK(cudaMemcpy, (a device, a host, nsizeof(float),
41 cudaMemcpyHostToDevice));
42 CUDA CHK(cudaMemcpy, (b device, b host, nsizeof(float),
43 cudaMemcpyHostToDevice));
44
45 dim3 block dim(16, 16);
46 int block size = block dim.xblock dim.y;
47 int n blocks = (n + block size1) / block size ;
48 square array <<<n blocks, block dim>>>(a device, b device, n);

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

My first CUDA program (Solution)

Output and Clean-up:


52 CUDA CHK(cudaMemcpy, (a host, a device, nsizeof(float),
53 cudaMemcpyDeviceToHost));
54
55 for (int i = 0; i < n; i ++)
56 printf (%.0f , a host [ i ]);
57 puts(\n);
58
59 free (a host );
60 CUDA CHK(cudaFree, (a device));
61 }

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Outline

1 Introduction

2 Programming GPUs
Overview
Dealing with Memory

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Memory Model

Block (0, 0) Block (1, 0)


Already seen:
Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Local Local Local Local

Global

Constant

Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Memory Model

Block (0, 0) Block (1, 0)


Already seen:
Shared Memory Shared Memory
Registers
Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Local Local Local Local

Global

Constant

Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Memory Model

Block (0, 0) Block (1, 0)


Already seen:
Shared Memory Shared Memory
Registers
Registers Registers Registers Registers
Global

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Local Local Local Local

Global

Constant

Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Registers

Multiprocessor
Shared Memory

Registers Registers
32 KiB of registers per MP
Per-thread
Thread (0, 0) Thread (1, 0)
Latency: 1 clock
Variable amount per thread
Local Local
Register count limits max.
threads/MP
CPUs: Fixed register file () Global

Constant

Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Global Memory

Multiprocessor
Shared Memory

Registers Registers

Several GiB usually


Per-GPU Thread (0, 0) Thread (1, 0)

Latency: 1000 clocks


512 bit memory bus Local Local
Best throughput: 16 consecutive
threads read aligned chunk Global

Constant

Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Example: Matrix Transpose

c1,1 c1,2 c1,3 c1,4 c1,1 c2,1 c3,1 c4,1

c2,1 c2,2 c2,3 c2,4 c1,2 c2,2 c3,2 c4,2

c3,1 c3,2 c3,3 c3,4 c1,3 c2,3 c3,3 c4,3

c4,1 c4,2 c4,3 c4,4 c1,4 c2,4 c3,4 c4,4

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Naive: Using global memory

First attempt: Naive port of the CPU code.


global void transpose( float out, float in , int w, int h) {
unsigned int xIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = in[ idx in ];


}
}

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Measuring Performance

Writing high-performance Codes


Mindset: What is going to be the limiting factor?
Floating point throughput?
Memory bandwidth?
Cache sizes?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Measuring Performance

Writing high-performance Codes


Mindset: What is going to be the limiting factor?
Floating point throughput?
Memory bandwidth?
Cache sizes?

Benchmark the assumed limiting factor right away.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Measuring Performance

Writing high-performance Codes


Mindset: What is going to be the limiting factor?
Floating point throughput?
Memory bandwidth?
Cache sizes?

Benchmark the assumed limiting factor right away.


Evaluate
Know your peak throughputs (roughly)
Are you getting close?
Are you tracking the right limiting factor?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Performance: Matrix transpose

Very likely: Bound by memory bandwidth.

3.0
Naive
Memory Bandwidth [GB/s]

2.5
2.0
1.5
1.0
0.5
0.0 6
10 107 108
Matrix size [Bytes]
Fantastic! About same as CPU. Why?
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Memory

Naive: Using global memory

global void transpose( float out, float in , int w, int h) {


unsigned int xIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = in[ idx in ];


}
}

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Naive: Using global memory

global void transpose( float out, float in , int w, int h) {


unsigned int xIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = in[ idx in ];


}
}

Reading from global mem:


...
stride: 1

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Naive: Using global memory

global void transpose( float out, float in , int w, int h) {


unsigned int xIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = in[ idx in ];


}
}

Reading from global mem:


...
stride: 1 one mem.trans.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Naive: Using global memory

global void transpose( float out, float in , int w, int h) {


unsigned int xIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = in[ idx in ];


}
}

Reading from global mem: Writing to global mem:


... ...
stride: 1 one mem.trans. stride: 16

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Naive: Using global memory

global void transpose( float out, float in , int w, int h) {


unsigned int xIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = in[ idx in ];


}
}

Reading from global mem: Writing to global mem:


... ...
stride: 1 one mem.trans. stride: 16 16 mem.trans.!

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Texture Memory

Multiprocessor

Same memory as global Shared Memory

But: more access patterns achieve Registers Registers


usable bandwidth
Optional: 2D and 3D indexing Thread (0, 0) Thread (1, 0)
Small, incoherent Cache
(prefers nD-local access)
Local Local
Read-only
Latency: 1000 clocks Global

(despite cache!) Constant

Optional: Linear Interpolation Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Transpose with Textures


texture <float, 1, cudaReadModeElementType> in tex;

global void transpose( float out, int w, int h) {


unsigned int yIdx = blockDim.x blockIdx.x + threadIdx.x;
unsigned int xIdx = blockDim.y blockIdx.y + threadIdx.y;

if ( xIdx < w && yIdx < h ) {


unsigned int idx in = xIdx + w yIdx;
unsigned int idx out = yIdx + h xIdx;

out[ idx out ] = tex1Dfetch(in tex , idx in );


}
}

#define PREPARE \
cudaBindTexture(0, in tex , d idata , mem size); \
std :: swap(grid .x, grid .y ); \
std :: swap(threads.x, threads .y );

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Performance: Transpose with Textures

25
Naive
20
Textures
Memory Bandwidth [GB/s]

15

10

0
106 107 108
Matrix size [Bytes]

Better! But texture units cant quite hide wide data bus.
Need different idea.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Memory

Shared Memory

Multiprocessor
Shared Memory

Registers Registers
16 KiB of shared mem per MP
Per-block
Thread (0, 0) Thread (1, 0)
Latency: 2 clocks
Variable amount per block
Shared memory limits max. Local Local

blocks/MP
Global
Banked
Constant

Texture

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Transpose: Idea

Global memory dislikes non-unit strides.


Shared memory doesnt mind.

Idea
Dont transpose element-by-element.
Transpose block-by-block instead.

1 Read untransposed block from global and write to shared


2 Read block transposed from shared and write to global

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Illustration: Blockwise Transpose

C1,1 C1,2 C1,3 C1,4 T


C1,1 T
C2,1 T
C3,1 T
C4,1

C2,1 C2,2 C2,3 C2,4 T


C1,2 T
C2,2 T
C3,2 T
C4,2

C3,1 C3,2 C3,3 C3,4 T


C1,3 T
C2,3 T
C3,3 T
C4,3

C4,1 C4,2 C4,3 C4,4 T


C1,4 T
C2,4 T
C3,4 T
C4,4

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Improved: Using shared memory

global void transpose( float out, float in, int w, int h ) {


shared float block [BLOCK DIMBLOCK DIM];

unsigned int xBlock = blockDim.x blockIdx.x;


unsigned int yBlock = blockDim.y blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;


unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index out , index transpose ;

if ( xIndex < w && yIndex < h ) {


unsigned int index in = w yIndex + xIndex;
unsigned int index block = threadIdx.y BLOCK DIM + threadIdx.x;

block [ index block ] = in[ index in ];


index transpose = threadIdx.x BLOCK DIM + threadIdx.y;
index out = h (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
syncthreads ();

if ( xIndex < w && yIndex < h ) {


out[ index out ] = block[ index transpose ];
}
}

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Performance: Transpose with Shared Memory

35
Naive
30 Textures
Shared
Memory Bandwidth [GB/s]

25
20
15
10
5
0
106 107 108
Matrix size [Bytes]

Not bad! Are we done?


Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Overview Memory

Review: Memory Model

Type Per Access Latency


Registers thread R/W 1
Local thread R/W 1000
Shared block R/W 2
Global grid R/W 1000 Not cached
Constant grid R/O 1-1000 Cached
Texture grid R/O 1000 Spatially cached

Important
Dont choose one type of memory.
Successful algorithms combine many types strengths.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Overview Memory

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Scripting Languages

Python:
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
uses run-time typing.
works well for gluing lower-level blocks together.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Scripting: Goals

Scripting languages aim to reduce the load on the programmer:


Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time

Think about the tools you use.


Use the right tool for the job.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Why do Scripting for GPUs?

GPUs are everything that scripting


languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
complement each other
CPU: largely restricted to control
tasks (1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Scripting: Speed

Usual answer to the Speed


Question:
Hybrid (mixed) Code.
Plays to the strengths of each
language.
But: Introduces (some)
complexity.

Observation: GPU code is already hybrid.

Consequence: No added complexity through hybrid code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Whetting your appetite

1 import pycuda.driver as cuda


2 import pycuda.autoinit
3 import numpy
4
5 a = numpy.random.randn(4,4).astype(numpy.float32)
6 a gpu = cuda.mem alloc(a.nbytes)
7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCuda distribution.]

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Whetting your appetite

1 mod = cuda.SourceModule(
2 global void doublify ( float a) Compute kernel
3 {
4 int idx = threadIdx.x + threadIdx.y4;
5 a[ idx ] = 2;
6 }
7 )
8
9 func = mod.get function(doublify)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Whetting your appetite, Part II

Did somebody say Abstraction is good?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Whetting your appetite, Part II

1 import numpy
2 import pycuda.autoinit
3 from pycuda import gpuarray
4
5 a cpu = numpy.random.randn(4,4).astype(numpy.float32)
6 b cpu = numpy.random.randn(4,4).astype(numpy.float32)
7 c cpu = a cpu b cpu
8
9 a gpu = gpuarray.to gpu(a cpu)
10 b gpu = gpuarray.to gpu(b cpu)
11 c gpu = (a gpu b gpu).get()
12
13 print c cpu c gpu

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Remember me?

1 // trivia
2 #include <stdio.h>
3
4 #define CUDA CHK(NAME, ARGS) { \
5 cudaError t cuda err code = NAME ARGS; \
6 if (cuda err code != cudaSuccess) { \ 1 // main2
7 printf (%s failed with code %d\n, #NAME, cuda err code); \ 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }
8 abort (); \ 3
9 }\ 4 CUDA CHK(cudaMemcpy, (a device, a host, nsizeof(float),
10 } 5 cudaMemcpyHostToDevice));
11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, nsizeof(float),
12 7 cudaMemcpyHostToDevice));
13 // kernel 8
14 global void square array ( float a, float b, int n) 9 dim3 block dim(16, 16);
15 { 10 int block size = block dim.xblock dim.y;
16 int i = ( blockIdx .x blockDim.y + threadIdx.y) 11 int n blocks = (n + block size1) / block size ;
17 blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n);
18 if ( i < n) 13 // end
19 a[ i ] = a[i ] b[i ]; 14
20 } 15 // main3
21 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, nsizeof(float),
22 17 cudaMemcpyDeviceToHost));
23 // main1 18
24 int main() 19 for ( int i = 0; i < n; i++)
25 { 20 printf (%.0f , a host [ i ]);
26 cudaSetDevice(0); // EDIT ME 21 puts(\n);
27 22
28 const int n = 4096; 23 free (a host );
29 24 CUDA CHK(cudaFree, (a device));
30 float a host = (float ) malloc(nsizeof( float )); 25 }
31 float b host = (float ) malloc(nsizeof( float )); 26 // end
32
33 float a device, b device;
34 CUDA CHK(cudaMalloc, ((void ) &a device, nsizeof(float)));
35 CUDA CHK(cudaMalloc, ((void ) &b device, nsizeof(float)));
36 // end

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda Philosophy

Provide complete access


Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errors
automatically
Integrate tightly with numpy

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Completeness

PyCuda exposes all of CUDA.

For example:
Arrays and Textures
Pagelocked host memory
Memory transfers (asynchronous, structured)
Streams and Events
Device queries
(GL Interop)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Completeness

PyCuda supports every OS that CUDA supports.

Linux
Windows
OS X

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit

Run

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit

Run

SourceModule("...")

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit

Run

SourceModule("...")
PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit Cache?

Run

SourceModule("...")
PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit Cache?

Run nvcc

SourceModule("...")
PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit Cache?

Run nvcc .cubin

SourceModule("...")
PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit Cache!

Run nvcc .cubin

SourceModule("...")
PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit Cache!

Run nvcc .cubin

SourceModule("...") Upload to GPU


PyCuda

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Workflow

Edit Cache!

Run nvcc .cubin

SourceModule("...") Upload to GPU


PyCuda

Run on GPU

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Kernel Invocation: Automatic Copies

mod = pycuda.driver.SourceModule(
global my func(float out, float in ){...} )
func = mod.get function(my func)

src = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.empty like(src)

my func(
cuda.Out(dest),
cuda.In( src ),
block=(400,1,1))

InOut exists, too.


Only for immediate invocation style.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Automatic Cleanup

Reachable objects (memory,


streams, . . . ) are never destroyed.
Once unreachable, released at an
unspecified future time.
Scarce resources (memory) can be
explicitly freed. (obj.free())
Correctly deals with multiple
contexts and dependencies.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

gpuarray: Simple Linear Algebra

pycuda.gpuarray:
Meant to look and feel just like numpy.
gpuarray.to gpu(numpy array)
numpy array = gpuarray.get()
No: nd indexing, slicing, etc. (yet!)
Yes: +, -, , /, fill, sin, exp, rand, take, . . .
Random numbers using pycuda.curandom
Mixed types (int32 + float32 = float64)
print gpuarray for debugging.
Memory behind gpuarray available as .gpudata
attribute.
Use as kernel arguments, textures, etc.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

gpuarray: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:


from pycuda.curandom import rand as curand
a gpu = curand((50,))
b gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernel


lin comb = ElementwiseKernel(
float a, float x, float b, float y, float z,
z[ i ] = ax[i ] + by[i])

c gpu = gpuarray.empty like (a gpu)


lin comb(5, a gpu, 6, b gpu, c gpu)

assert la .norm((c gpu (5a gpu+6b gpu)).get()) < 1e5

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Vital Information

http://mathema.tician.de/
software/pycuda
X Consortium License
(no warranty, free for all use)
Requires: numpy, Boost C++,
Python 2.4+.
Support via mailing list.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

CUDA APIs

C/C++ Python

Runtime API PyCuda

Driver API

Kernel Driver

Hardware

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

CUDA APIs

C/C++ Python CUDA has two Programming


Interfaces:
Runtime API PyCuda Runtime high-level
(libcudart.so, in the
Driver API toolkit)
Driver low-level
Kernel Driver (libcuda.so, comes with
GPU driver)
Hardware (mutually exclusive)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Runtime vs. Driver API

Runtime Driver differences:


Explicit initialization.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Runtime vs. Driver API

Runtime Driver differences:


Explicit initialization.
Code objects (Modules) become programming language
objects.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Runtime vs. Driver API

Runtime Driver differences:


Explicit initialization.
Code objects (Modules) become programming language
objects.
Texture handling requires slightly more work.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Runtime vs. Driver API

Runtime Driver differences:


Explicit initialization.
Code objects (Modules) become programming language
objects.
Texture handling requires slightly more work.
Only needs nvcc for compiling GPU code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Runtime vs. Driver API

Runtime Driver differences:


Explicit initialization.
Code objects (Modules) become programming language
objects.
Texture handling requires slightly more work.
Only needs nvcc for compiling GPU code.
Driver API:
Conceptually cleaner
Less sugar-coating (provide in Python)
Not very different otherwise

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: API Tracing

With ./configure --cuda-trace=1:

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: API Tracing

With ./configure --cuda-trace=1:


import pycuda. driver as cuda cuInit
import pycuda. autoinit cuDeviceGetCount
import numpy cuDeviceGet
cuCtxCreate
a = numpy.random.randn(4,4).astype(numpy.float32) cuMemAlloc
a gpu = cuda.mem alloc(a.nbytes) cuMemcpyHtoD
cuda.memcpy htod(a gpu, a) cuCtxGetDevice
cuDeviceComputeCapability
mod = cuda.SourceModule( cuModuleLoadData
global void doublify ( float a) cuModuleGetFunction
{ cuFuncSetBlockShape
int idx = threadIdx.x + threadIdx.y4; cuParamSetv
a[ idx ] = 2; cuParamSetSize
} cuLaunchGrid
) cuMemcpyDtoH
cuCtxPopCurrent
func = mod.get function(doublify) cuCtxPushCurrent
func(a gpu, block=(4,4,1)) cuMemFree
cuCtxPopCurrent
a doubled = numpy.empty like(a) cuCtxPushCurrent
cuda.memcpy dtoh(a doubled, a gpu) cuModuleUnload
print a doubled cuCtxPopCurrent
print a cuCtxDestroy

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting
Scripting + GPUs: A good combination
Whetting your Appetite
Working with PyCuda
A peek under the hood
Metaprogramming CUDA

4 PyCuda Hands-on: Matrix Multiplication

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

In PyCuda,
CUDA C code
does not need to
be a compile-time
constant.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

In PyCuda,
CUDA C code
does not need to
be a compile-time
constant.

(unlike the CUDA Runtime API)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

Idea

In PyCuda,
CUDA C code
does not need to
be a compile-time
constant.

(unlike the CUDA Runtime API)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

Idea

In PyCuda,
Python Code CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc

.cubin
(unlike the CUDA Runtime API)
GPU

Result

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

Idea

In PyCuda,
Python Code CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc

.cubin Machine
(unlike the CUDA Runtime API)
GPU

Result

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

Idea
Human In PyCuda,
Python Code CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc

.cubin
(unlike the CUDA Runtime API)
GPU

Result

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Human vs Machine

Idea

In PyCuda,
Python Code Easy to write CUDA C code
does not need to
CUDA C Code
be a compile-time
constant.
nvcc

.cubin
(unlike the CUDA Runtime API)
GPU

Result

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Metaprogramming: machine-generated code

Why machine-generated code?

Flexible Your Code Fast

Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
( register pressure)
Loop Unrolling

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

PyCuda: Support for Metaprogramming

Access properties of compiled code:


func.{num regs,shared size bytes,local size bytes}
Exact GPU timing via events
Can calculate hardware-dependent MP occupancy
codepy (by Andreas):
Build C syntax trees from Python
Generates readable, indented C
Or use a templating engine (many available, e.g. Cheetah)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Questions?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication


Simple
Tiled
Meta-programming / Auto-tuning

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication


Simple
Tiled
Meta-programming / Auto-tuning

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Goal

Multiply two (small) square matrices together.


Use global memory only.
Use a single block of threads.
Each thread computes one element of the resulting matrix.

Instructions
1 cd 3-pycuda-matrixmul-simple
2 Edit matrixmul simple.py
3 Complete the TODOs.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code

Initialization:
import numpy as np
from pycuda import driver , compiler , gpuarray, tools
import atexit

# initialize the device


# the following lines are equivalent to import pycuda. autoinit
# only if GPU NUMBER = 0
GPU NUMBER = 0 # TODO: change me
driver . init ()
assert ( driver .Device.count() >= 1)
dev = tools . get default device (GPU NUMBER)
ctx = dev.make context()
atexit . register (ctx .pop)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
Memory allocation and transfer:
# define the (square) matrix size
# note that we ll only use one block of threads here
# as a consequence this number (squared) cant exceed max threads,
# see http://documen.tician.de/pycuda/util .html#pycuda.tools.DeviceData
# for more information on how to get this number for your device
MATRIX SIZE = 2

# create two random square matrices


a cpu = np.random.randn(MATRIX SIZE, MATRIX SIZE).astype(np.float32)
b cpu = np.random.randn(MATRIX SIZE, MATRIX SIZE).astype(np.float32)

# compute reference on the CPU to verify GPU computation


c cpu = np.dot(a cpu, b cpu)

# transfer host (CPU) memory to device (GPU) memory


a gpu = gpuarray.to gpu(a cpu)
b gpu = gpuarray.to gpu(b cpu)

# create empty gpu array for the result (C = A B)


c gpu = gpuarray.empty((MATRIX SIZE, MATRIX SIZE), np.float32)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
GPU code compilation and execution:
# by specifying the constant MATRIX SIZE
kernel code = kernel code template % {
MATRIX SIZE: MATRIX SIZE
}

# compile the kernel code


mod = compiler.SourceModule(kernel code)

# get the kernel function from the compiled module


matrixmul = mod.get function(MatrixMulKernel)

# call the kernel on the card


matrixmul(
# inputs
a gpu, b gpu,
# output
c gpu,
# (only one) block of MATRIX SIZE x MATRIX SIZE threads
block = (MATRIX SIZE, MATRIX SIZE, 1),
)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
GPU kernel code:
kernel code template =

global void MatrixMulKernel(float a, float b, float c)


{
// 2D Thread ID (assuming that only one block will be executed)
int tx = threadIdx.x;
int ty = threadIdx.y;

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

// Each thread loads one row of M and one column of N,


// to produce one element of P.
for ( int k = 0; k < %(MATRIX SIZE)s; ++k) {
float Aelement = a[ ... ]; // TODO
float Belement = b[ ... ]; // TODO
Pvalue += Aelement Belement;
}

// Write the matrix to device memory;


// each thread writes one element
c[ ... ] = Pvalue; // TODO
}

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
GPU kernel code (solution):
kernel code template =

global void MatrixMulKernel(float a, float b, float c)


{
// 2D Thread ID (assuming that only one block will be executed)
int tx = threadIdx.x;
int ty = threadIdx.y;

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

// Each thread loads one row of M and one column of N,


// to produce one element of P.
for ( int k = 0; k < %(MATRIX SIZE)s; ++k) {
float Aelement = a[ty %(MATRIX SIZE)s + k];
float Belement = b[k %(MATRIX SIZE)s + tx];
Pvalue += Aelement Belement;
}

// Write the matrix to device memory;


// each thread writes one element
c[ ty %(MATRIX SIZE)s + tx] = Pvalue;
}

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication


Simple
Tiled
Meta-programming / Auto-tuning

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Goal

Multiply two square matrices together.


Use global memory and shared memory.
Each thread block is assigned a tile of the resulting matrix
and is responsible for generating the elements in that tile.
Each thread in a block computes one element of the tile.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
GPU kernel code:
kernel code template =

global void MatrixMulKernel(float A, float B, float C)


{

const uint wA = %(MATRIX SIZE)s;


const uint wB = %(MATRIX SIZE)s;

// Block index
const uint bx = blockIdx.x;
const uint by = blockIdx.y;

// Thread index
const uint tx = threadIdx.x;
const uint ty = threadIdx.y;

// Index of the first submatrix of A processed by the block


const uint aBegin = wA %(BLOCK SIZE)s by;
// Index of the last submatrix of A processed by the block
const uint aEnd = aBegin + wA 1;
// Step size used to iterate through the submatrices of A
const uint aStep = %(BLOCK SIZE)s;

// Index of the first submatrix of B processed by the block


const uint bBegin = %(BLOCK SIZE)s bx;
// Step size used to iterate through the submatrices of B
const uint bStep = %(BLOCK SIZE)s wB;

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
GPU kernel code (contd):
// The element of the block submatrix that is computed
// by the thread
float Csub = 0;
// Loop over all the submatrices of A and B required to
// compute the block submatrix
for ( int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the submatrix of A
shared float As[%(BLOCK SIZE)s][%(BLOCK SIZE)s];
// Shared memory for the submatrix of B
shared float Bs[%(BLOCK SIZE)s][%(BLOCK SIZE)s];

// Load the matrices from global memory to shared memory;


// each thread loads one element of each matrix
As[ty ][ tx ] = A[a + wA ty + tx];
Bs[ty ][ tx ] = B[b + wB ty + tx];
// Synchronize to make sure the matrices are loaded
syncthreads ();

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code
GPU kernel code (contd):
// Multiply the two matrices together ;
// each thread computes one element
// of the block submatrix
for ( int k = 0; k < %(BLOCK SIZE)s; ++k)
Csub += As[ty][k] Bs[k ][ tx ];

// Synchronize to make sure that the preceding


// computation is done before loading two new
// submatrices of A and B in the next iteration
syncthreads ();
}

// Write the block submatrix to global memory;


// each thread writes one element
const uint c = wB %(BLOCK SIZE)s by + %(BLOCK SIZE)s bx;
C[c + wB ty + tx] = Csub;
}

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Outline

1 Introduction

2 Programming GPUs

3 GPU Scripting

4 PyCuda Hands-on: Matrix Multiplication


Simple
Tiled
Meta-programming / Auto-tuning

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Goal
Multiply two matrices together (any size).
Use global memory and shared memory.
Implement various optimizations:
different granularities of parallelism (block and work sizes),
loop unrolling,
register pressure (spilling),
pre-fetching (global memory load).
Instrumentalize the code using a template engine (Cheetah).
Auto-tune depending on the hardware and the input data.

Instructions
1 cd 5-pycuda-matrixmul-opt
2 Implement your auto-tuning function.
3 Use PyCuda to gather informations (registers, occupancy).

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Code

Show the code ;-)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Some numbers

3D Filterbank Convolutions used in our Visual Cortex Simulations:

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Conclusions

GPUs (or something like them) are here to stay.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Conclusions

GPUs (or something like them) are here to stay.


First factor of 5-10 is usually easy to reach.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Conclusions

GPUs (or something like them) are here to stay.


First factor of 5-10 is usually easy to reach.
Next factor of 5-10 is a little bit harder.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Conclusions

GPUs (or something like them) are here to stay.


First factor of 5-10 is usually easy to reach.
Next factor of 5-10 is a little bit harder.
Next factor of 5-10 is a lot harder.
Reqires deep understanding of the hardware architecture.
Usually involves significant rethinking of algorithm.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Conclusions

GPUs (or something like them) are here to stay.


First factor of 5-10 is usually easy to reach.
Next factor of 5-10 is a little bit harder.
Next factor of 5-10 is a lot harder.
Reqires deep understanding of the hardware architecture.
Usually involves significant rethinking of algorithm.
GPUs and scripting work surprisingly well together.
Favorable balance btw ease-of-use and raw performance.
Enable (easy) Metaprogramming.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Conclusions

GPUs (or something like them) are here to stay.


First factor of 5-10 is usually easy to reach.
Next factor of 5-10 is a little bit harder.
Next factor of 5-10 is a lot harder.
Reqires deep understanding of the hardware architecture.
Usually involves significant rethinking of algorithm.
GPUs and scripting work surprisingly well together.
Favorable balance btw ease-of-use and raw performance.
Enable (easy) Metaprogramming.
Python / PyCuda rocks!

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Intro GPUs Scripting Hands-on Simple Tiled Meta-programming / Auto-tuning

Thank you

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial

You might also like