0% found this document useful (0 votes)

6 views

06 Intro Gpus

The document provides an introduction to parallel computing using GPUs, detailing their architectures, programming models, and applications. It discusses the evolution of GPUs from fixed-function devices to general-purpose processors and highlights the importance of retargeting code for GPU usage. Additionally, it outlines various GPU programming models, including CUDA and OpenCL, and emphasizes the need for high levels of parallelism and memory bandwidth in GPU applications.

Uploaded by

黄家毅

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

06 Intro Gpus

Uploaded by

黄家毅

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CSC266 Introduction to Parallel Computing

using GPUs
Introduction to Accelerators

Sreepathi Pai
October 11, 2017
URCS
Outline

Introduction to Accelerators

GPU Architectures

GPU Programming Models

Outline

Introduction to Accelerators

GPU Architectures

GPU Programming Models

Accelerators

• Single-core processors
• Multi-core processors
• What if these aren’t enough?
• Accelerators, specifically GPUs
• what they are
• when you should use them
Timeline

• 1980s
• Geometry Engines
• 1990s
• Consumer GPUs
• Out-of-order Superscalars
• 2000s
• General-purpose GPUs
• Multicore CPUs
• Cell BE (Playstation 3)
• Lots of specialized accelerators in phones
The Graphics Processing Unit (1980s)

• SGI Geometry Engine

• Implemented the Geometry Pipeline
• Hardwired logic
• Embarrassingly Parallel
• O(Pixels)
• Large number of logic elements
• High memory bandwidth
• From Kaufman et al. (2009):
GPU 2.0 (circa 2004)

• Like CPUs, GPUs benefited from Moore’s Law

• Evolved from fixed-function hardwired logic to flexible,
programmable ALUs
• Around 2004, GPUs were programmable “enough” to do some
non-graphics computations
• Severely limited by graphics programming model (shader
programming)
• In 2006, GPUs became “fully” programmable
• GPGPU: General-Purpose GPU
• NVIDIA releases “CUDA” language to write non-graphics
programs that will run on GPUs
FLOPS/s

NVIDIA CUDA C Programming Guide

Memory Bandwidth

NVIDIA CUDA C Programming Guide

GPGPU Today

• GPUs are widely deployed as

accelerators
• Intel Paper
• 10x vs 100x Myth
• GPUs so successful that
other accelerators are dead
• Sony/IBM Cell BE
• Clearspeed RSX
• Kepler K40 GPUs from
NVIDIA have performance
of 4TFlops (peak)
• CM-5, #1 system in 1993
was 60 Gflops (Linpack)
• ASCI White (#1 2001)
was 4.9 Tflops (Linpack)
Pictures of Titan and Tianhe 1A from the Top500 website.
Accelerator Programming Models

• CPUs have always depended on co-processors

• I/O co-processors to handle slow I/O
• Math co-processors to speed up computation
• H.264 co-processor to play video (Phones)
• DSPs to handle audio (Phones)
• Many have been transparent
• Drop in the co-processor and everything sped up
• Or used a function-based model
• Call a function and it is sped up (e.g. “decode video”)
• The GPU is not a transparent accelerator for general purpose
computations
• Only graphics code is sped up transparently
• Code must be rewritten to target GPUs
Using a GPU

• You must retarget code for the GPU

• Rewrite, recompile, translate, etc.
Outline

Introduction to Accelerators

GPU Architectures

GPU Programming Models

The Two (Three?) Kinds of GPUs

• Type 1: Discrete GPUs

• More computational power
• More memory bandwidth
• Separate memory
NVIDIA
The Two (Three?) Kinds of GPUs #2

• Type 2: Integrated GPUs

• Share memory with processor
• Share bandwidth with processor
• Consume Less power
• Can participate in cache coherence
Intel
The NVIDIA Kepler

NVIDIA Kepler GK110 Whitepaper

Using a Discrete GPU

• You must retarget code for the GPU

• 2-wide Inorder
• 4-wide SMT
• 2048 threads per core (64 warps)
• 15 cores
• Each thread runs the same code (hence SIMT)
• 65536 32-bit registers (256KBytes)
• A thread can use upto 255 of these
• Partitioned among threads (not shared!)
• 192 ALUs
• 64 Double-precision
• 32 Load/store
• 32 Special Functional Unit
• 64 KB L1/Shared Cache
• Shared cache is software-managed cache
CPU vs GPU

Parameter CPU GPU

Clockspeed > 1 GHz 700 MHz
RAM GB to TB 12 GB (max)
Memory B/W 60 GB/s > 300 GB/s
Peak FP < 1 TFlop > 1 TFlop
Concurrent Threads O(10) O(1000)
[O(10000)]
LLC cache size > 100MB (L3) < 2MB (L2)
[eDRAM] O(10)
[traditional]
Cache size per thread O(1 MB) O(10 bytes)
Software-managed cache None 48KB/SMX
Type OOO super- 2-way Inorder su-
scalar perscalar
Using a GPU

• You must retarget code for the GPU

• Rewrite, recompile, translate, etc.
• Working set must fit in GPU RAM
• You must copy data to/from GPU RAM
• “You”: Programmer, Compiler, Runtime, OS, etc.
• Some recent hardware can do this for you
• Data accesses should be streaming
• Or use scratchpad as user-managed cache
• Lots of parallelism preferred (throughput, not latency)
• SIMD-style parallelism best suited
• High arithmetic intensity (FLOPs/byte) preferred
Showcase GPU Applications

• Image Processing
• Graphics Rendering
• Matrix Multiply
• FFT
See “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU” by V.W.Lee et al. for more
examples and a comparison of CPU and GPU.
Outline

Introduction to Accelerators

GPU Architectures

GPU Programming Models

Hierarchy of GPU Programming Models

Model GPU CPU Equivalent

Vectorizing Compiler PGI CUDA Fortran gcc, icc, etc.
“Drop-in” Libraries cuBLAS ATLAS
Directive-driven OpenACC, OpenMP
OpenMP-to-CUDA
High-level languages pyCUDA python
Mid-level languages OpenCL, CUDA pthreads +
C/C++
Low-level languages PTX, Shader -
Bare-metal SASS Assembly/Machine
code
“Drop-in” Libraries

• “Drop-in” replacements for

popular CPU libraries,
examples from NVIDIA:
• CUBLAS/NVBLAS for
BLAS (e.g. ATLAS)
• CUFFT for FFTW
• MAGMA for LAPACK
and BLAS
• These libraries may still
expect you to manage data
transfers manually
• Libraries may support
multiple accelerators (GPU
+ CPU + Xeon Phi)
GPU Libraries

• NVIDIA Thrust
• Like C++ STL, but
executes on the GPU
• Modern GPU
• At first glance:
high-performance library
routines for sorting,
searching, reductions, etc.
• A deeper look: Specific
“hard” problems tackled
in a different style
• NVIDIA CUB
• Low-level primitives for
use in CUDA kernels
Directive-Driven Programming
• OpenACC, new standard for “offloading” parallel work to an
accelerator
• Currently supported only by PGI Accelerator compiler
• gcc 5.0 support is ongoing
• OpenMPC, a research compiler, can compile OpenMP code +
extra directives to CUDA
• OpenMP 4.0 also supports offload to accelerators
• Not for GPUs yet

int main(void) {
double pi = 0.0f; long i;

#pragma acc parallel loop reduction(+:pi)

for (i=0; i<N; i++) {
double t= (double)((i+0.5)/N);
pi +=4.0/(1.0+t*t);
}

printf("pi=%16.15f\n",pi/N);
return 0;
}
Python-based Tools (pyCUDA)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule

mod = SourceModule(""\"
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""\")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)

multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))

print dest-a*b
OpenCL

• C99-based dialect for programming heterogenous systems

• Originally based on CUDA
• nomenclature is different
• Supported by more than GPUs
• Xeon Phi, FPGAs, CPUs, etc.
• Source code is portable (somewhat)
• Performance may not be!
• Poorly supported by NVIDIA
CUDA

• “Compute Unified Device Architecture”

• First language to allow general-purpose programming for
GPUs
• preceded by shader languages
• Promoted by NVIDIA for their GPUs
• Not supported by any other accelerator
• though commercial CUDA-to-x86/64 compilers exist
• We will focus on CUDA programs
CUDA Architecture

• From 10000 feet – CUDA is like pthreads

• CUDA language – C++ dialect
• Host code (CPU) and GPU code in same file
• Special language extensions for GPU code
• CUDA Runtime API
• Manages runtime GPU environment
• Allocation of memory, data transfers, synchronization with
GPU, etc.
• Usually invoked by host code
• CUDA Device API
• Lower-level API that CUDA Runtime API is built upon
CUDA Limitations

• No standard library for GPU functions

• No parallel data structures
• No synchronization primitives (mutex, semaphores, queues,
etc.)
• you can roll your own
• only atomic*() functions provided
• Toolchain not as mature as CPU toolchain
• Felt intensely in performance debugging
• It’s only been a decade :)
Conclusions

• GPUs are very interesting parallel machines

• They’re not going away
• Xeon Phi might pose a formidable challenge
• They’re here and now
• Your laptop probably already contains one
• Your phone definitely has one

AVR
100% (3)
AVR
104 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
GPGPU
No ratings yet
GPGPU
139 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CUDA
No ratings yet
CUDA
46 pages
Part1 22
No ratings yet
Part1 22
77 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
1 Cuda
100% (1)
1 Cuda
173 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Owens
No ratings yet
Owens
67 pages
gpus
No ratings yet
gpus
32 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
p10-cuda
No ratings yet
p10-cuda
28 pages
Intro To Gpu &amp Cuda
No ratings yet
Intro To Gpu &amp Cuda
15 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
w13s1_MultiprocessingGPU
No ratings yet
w13s1_MultiprocessingGPU
21 pages
10 - Introduction and Overview GPGPU
No ratings yet
10 - Introduction and Overview GPGPU
69 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Lecture 17-Introduction to GPU
No ratings yet
Lecture 17-Introduction to GPU
36 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
Lec 14
No ratings yet
Lec 14
52 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
CUDA
No ratings yet
CUDA
33 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
chapter-8
No ratings yet
chapter-8
58 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
CMOS Answers: 1. What Is Intrinsic and Extrinsic Semiconductor?
No ratings yet
CMOS Answers: 1. What Is Intrinsic and Extrinsic Semiconductor?
4 pages
UART in VHDL and Verilog For An FPGA
No ratings yet
UART in VHDL and Verilog For An FPGA
10 pages
Physics
No ratings yet
Physics
11 pages
C Program To Count Numbers and Display Using LEDS
No ratings yet
C Program To Count Numbers and Display Using LEDS
11 pages
Lab Volt 91004
No ratings yet
Lab Volt 91004
24 pages
Computer Architecture Unit 1 MCQ
No ratings yet
Computer Architecture Unit 1 MCQ
6 pages
En - stm32f7 WDG Timers Gptim
No ratings yet
En - stm32f7 WDG Timers Gptim
61 pages
Azuro Clock
100% (1)
Azuro Clock
20 pages
Brstm32 - STM32 Platform Processors
No ratings yet
Brstm32 - STM32 Platform Processors
12 pages
Is There A Real Difference Between DSPs and GPUs
100% (1)
Is There A Real Difference Between DSPs and GPUs
18 pages
2016 MR-Influence of fin number on hot-carrier injection stress induced degradation inbulk FinFETs
No ratings yet
2016 MR-Influence of fin number on hot-carrier injection stress induced degradation inbulk FinFETs
5 pages
bc10d Foxconn M730-1-01 MBX-185
No ratings yet
bc10d Foxconn M730-1-01 MBX-185
67 pages
SOW Computing LS 23-24
No ratings yet
SOW Computing LS 23-24
18 pages
EDA Companies List
No ratings yet
EDA Companies List
35 pages
LEON3 SPARC Processor, The Past Present and Future
No ratings yet
LEON3 SPARC Processor, The Past Present and Future
41 pages
Sata SSD 2.5 Inch
No ratings yet
Sata SSD 2.5 Inch
2 pages
Bojuxing Industry BJ8P509FGA - C84125
No ratings yet
Bojuxing Industry BJ8P509FGA - C84125
39 pages
Question of Unit 3 and 4
100% (1)
Question of Unit 3 and 4
3 pages
Lenovo Q190 Bitland BM6C66 Rev 1.0
100% (1)
Lenovo Q190 Bitland BM6C66 Rev 1.0
43 pages
Laptop Repair Note
No ratings yet
Laptop Repair Note
14 pages
UDN2987 6 Datasheet
No ratings yet
UDN2987 6 Datasheet
9 pages
F28335 Gpio PDF
No ratings yet
F28335 Gpio PDF
2 pages
MODULE 2 - 1 Computer System Architecture
No ratings yet
MODULE 2 - 1 Computer System Architecture
8 pages
2140707-BE-WINTER-2020
No ratings yet
2140707-BE-WINTER-2020
1 page
RFID Based Museum Guide For Tourists
No ratings yet
RFID Based Museum Guide For Tourists
47 pages
Datasheet
No ratings yet
Datasheet
8 pages
Pipelined RISC-V Processor With Cache
No ratings yet
Pipelined RISC-V Processor With Cache
7 pages
3 Stage and 5 Stage ARM
No ratings yet
3 Stage and 5 Stage ARM
4 pages
Gaussian 16 Source Code Installation Instructions, Rev. C.01
No ratings yet
Gaussian 16 Source Code Installation Instructions, Rev. C.01
3 pages

06 Intro Gpus

Uploaded by

06 Intro Gpus

Uploaded by

CSC266 Introduction to Parallel Computing

GPU Programming Models

GPU Programming Models

• SGI Geometry Engine

• Like CPUs, GPUs benefited from Moore’s Law

NVIDIA CUDA C Programming Guide

NVIDIA CUDA C Programming Guide

• GPUs are widely deployed as

• CPUs have always depended on co-processors

• You must retarget code for the GPU

GPU Programming Models

• Type 1: Discrete GPUs

• Type 2: Integrated GPUs

NVIDIA Kepler GK110 Whitepaper

• You must retarget code for the GPU

Parameter CPU GPU

• You must retarget code for the GPU

GPU Programming Models

Model GPU CPU Equivalent

• “Drop-in” replacements for

#pragma acc parallel loop reduction(+:pi)

• C99-based dialect for programming heterogenous systems

• “Compute Unified Device Architecture”

• From 10000 feet – CUDA is like pthreads

• No standard library for GPU functions

• GPUs are very interesting parallel machines

You might also like