06 Intro Gpus
06 Intro Gpus
using GPUs
Introduction to Accelerators
Sreepathi Pai
October 11, 2017
URCS
Outline
Introduction to Accelerators
GPU Architectures
Introduction to Accelerators
GPU Architectures
• Single-core processors
• Multi-core processors
• What if these aren’t enough?
• Accelerators, specifically GPUs
• what they are
• when you should use them
Timeline
• 1980s
• Geometry Engines
• 1990s
• Consumer GPUs
• Out-of-order Superscalars
• 2000s
• General-purpose GPUs
• Multicore CPUs
• Cell BE (Playstation 3)
• Lots of specialized accelerators in phones
The Graphics Processing Unit (1980s)
Introduction to Accelerators
GPU Architectures
• 2-wide Inorder
• 4-wide SMT
• 2048 threads per core (64 warps)
• 15 cores
• Each thread runs the same code (hence SIMT)
• 65536 32-bit registers (256KBytes)
• A thread can use upto 255 of these
• Partitioned among threads (not shared!)
• 192 ALUs
• 64 Double-precision
• 32 Load/store
• 32 Special Functional Unit
• 64 KB L1/Shared Cache
• Shared cache is software-managed cache
CPU vs GPU
• Image Processing
• Graphics Rendering
• Matrix Multiply
• FFT
See “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU” by V.W.Lee et al. for more
examples and a comparison of CPU and GPU.
Outline
Introduction to Accelerators
GPU Architectures
• NVIDIA Thrust
• Like C++ STL, but
executes on the GPU
• Modern GPU
• At first glance:
high-performance library
routines for sorting,
searching, reductions, etc.
• A deeper look: Specific
“hard” problems tackled
in a different style
• NVIDIA CUB
• Low-level primitives for
use in CUDA kernels
Directive-Driven Programming
• OpenACC, new standard for “offloading” parallel work to an
accelerator
• Currently supported only by PGI Accelerator compiler
• gcc 5.0 support is ongoing
• OpenMPC, a research compiler, can compile OpenMP code +
extra directives to CUDA
• OpenMP 4.0 also supports offload to accelerators
• Not for GPUs yet
int main(void) {
double pi = 0.0f; long i;
printf("pi=%16.15f\n",pi/N);
return 0;
}
Python-based Tools (pyCUDA)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule(""\"
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""\")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print dest-a*b
OpenCL