0% found this document useful (0 votes)

13 views25 pages

cuda

Uploaded by

rajkumar184

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views25 pages

cuda

Uploaded by

rajkumar184

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 25

Parallel Programming with CUDA

Matthew Guidry
Charles McClendon
Introduction to CUDA
• CUDA is a platform for performing massively
parallel computations on graphics accelerators
• CUDA was developed by NVIDIA
• It was first available with their G8X line of
graphics cards
• Approximately 1 million CUDA capable GPUs
are shipped every week
• CUDA presents a unique opportunity to
develop widely-deployed parallel applications
CUDA
• Because of the Power Wall, Latency Wall, etc
(free lunch is over), we must find a way to
keep our processor intensive programs from
slowing down to a crawl
• With CUDA developments it is possible to do
things like simulating Networks of Brain
Neurons
• CUDA brings the possibility of ubiquitous
supercomputing to the everyday computer…
CUDA
• CUDA is supported on all of NVIDIA’s G8X and
above graphics cards

• The current CUDA GPU Architecture is

branded Tesla

• 8-series GPUs offer 50-200 GFLOPS

CUDA Compilation
• As a programming model, CUDA is a set of
extensions to ANSI C
• CPU code is compiled by the host C compiler and the
GPU code (kernel) is compiled by the CUDA compiler.
Separate binaries are produced
CUDA Stack
Limitations of CUDA
• Tesla does not fully support IEEE spec for
double precision floating point operations
• Code only supported on NVIDIA hardware
• No use of recursive functions (can
workaround)
• Bus latency between host CPU and GPU

(Although double precision will be resolved with Fermi)

Thread Hierarchy

Thread – Distributed by the CUDA runtime

(identified by threadIdx)
Warp – A scheduling unit of up to 32 threads

Block – A user defined group of 1 to 512 threads.

(identified by blockIdx)

Grid – A group of
one or more blocks. A
grid is created for each
CUDA kernel function
CUDA Memory Hierarchy
• The CUDA platform has three primary memory types
Local Memory – per thread memory for automatic variables and
register spilling.

Shared Memory – per block low-latency memory to allow for

intra-block data sharing and synchronization. Threads can safely
share data through this memory and can perform barrier
synchronization through _ _syncthreads()

Global Memory – device level memory that may be shared

between blocks or grids
Moving Data…
CUDA allows us to copy data from
one memory type to another.

This includes dereferencing pointers,

even in the host’s memory (main
system RAM)

To facilitate this data movement

CUDA provides cudaMemcpy()
Optimizing Code for CUDA
• Prevent thread starvation by breaking your
problem down (128 execution units are available
for use, thousands of threads may be in flight)
• Utilize shared memory and avoid latency
problems (communicating with system memory
is slow)
• Keep in mind there is no built-in way to
synchronize threads in different blocks
• Avoid thread divergence in warps by blocking
threads with similar control paths
Code Example

Will be explained more in depth later…

Kernel Functions
• A kernel function is the basic unit of work
within a CUDA thread

• Kernel functions are CUDA extensions to

ANSI C that are compiled by the CUDA
compiler and the object code generator
Kernel Limitations
• There must be no recursion; there’s no call
stack

• There must no static variable declarations

• Functions must have a non-variable number

of arguments
CUDA Warp
• CUDA utilizes SIMT (Single Instruction Multiple
Thread)
• Warps are groups of 32 threads. Each warp
receives a single instruction and “broadcasts” it
to all of its threads.
• CUDA provides “zero-overhead” warp and thread
scheduling. Also, the overhead of thread creation
is on the order of 1 clock.
• Because a warp receives a single instruction, it
will diverge and converge as each thread
branches independently
CUDA Hardware
• The primary components of the Tesla
architecture are:
– Streaming Multiprocessor (The 8800 has 16)
– Scalar Processor
– Memory hierarchy
– Interconnection network
– Host interface
Streaming Multiprocessor (SM)
- Each SM has 8 Scalar Processors (SP)

- IEEE 754 32-bit floating point support (incomplete support)

- Each SP is a 1.35 GHz processor (32 GFLOPS peak)

- Supports 32 and 64 bit integers

- 8,192 dynamically partitioned 32-bit registers

- Supports 768 threads in hardware (24 SIMT warps of 32 threads)

- Thread scheduling done in hardware

- 16KB of low-latency shared memory

- 2 Special Function Units (reciprocal square root, trig functions, etc)

Each GPU has 16 SMs…

The GPU
Scalar Processor
• Supports 32-bit IEEE floating point
instructions:
FADD, FMAD, FMIN, FMAX, FSET, F2I, I2F
• Supports 32-bit integer operations
IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I, SHR, SHL,
AND, OR, XOR

• Fully pipelined
Code Example: Revisited
Myths About CUDA
• GPUs are the only processors in a CUDA application
– The CUDA platform is a co-processor, using the CPU and GPU
• GPUs have very wide (1000s) SIMD machines
– No, a CUDA Warp is only 32 threads
• Branching is not possible on GPUs
– Incorrect.
• GPUs are power-inefficient
– Nope, performance per watt is quite good
• CUDA is only for C or C++ programmers
– Not true, there are third party wrappers for Java, Python, and
more
Different Types of CUDA Applications
Future Developments of CUDA
• The next generation of CUDA, called “Fermi,”
will be the standard on the GeForce 300 series
• Fermi will have full support IEEE 754 double
precision
• Fermi will natively support more programming
languages
• Also, there is a new project, OpenCL that
seeks to provide an abstraction layer over
CUDA and similar platforms (AMD’s Stream)
Things to Ponder…

• Is CUDA better than Cell??

• How do I utilize 12,000 threads??

• Is CUDA really relevant anyway, in world

where web applications are so popular??
“Parallel Programming with CUDA”

By: Matthew Guidry

Charles McClendon

Agile-Lab-Manual 21-22 Agile-Lab-Manual 21-22
No ratings yet
Agile-Lab-Manual 21-22 Agile-Lab-Manual 21-22
111 pages
Bachelor of Science in Information Technology
No ratings yet
Bachelor of Science in Information Technology
2 pages
Pressman CH 3 Prescriptive Process Models
No ratings yet
Pressman CH 3 Prescriptive Process Models
36 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Solution Manual Real Time System BT Jane W S Liu Solution Manual PDF
50% (2)
Solution Manual Real Time System BT Jane W S Liu Solution Manual PDF
8 pages
C & C++Training Program
No ratings yet
C & C++Training Program
5 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
1 Cuda
100% (1)
1 Cuda
173 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
chapter-8
No ratings yet
chapter-8
58 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
CUDA
No ratings yet
CUDA
20 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
PART19
No ratings yet
PART19
20 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
govind_6
No ratings yet
govind_6
4 pages
course-7
No ratings yet
course-7
21 pages
CUDA
No ratings yet
CUDA
33 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
CUDA
No ratings yet
CUDA
46 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GTC-S62191 (1)
No ratings yet
GTC-S62191 (1)
89 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Lec 1
No ratings yet
Lec 1
27 pages
Cuda C
No ratings yet
Cuda C
70 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU_Programming_slides_2
No ratings yet
GPU_Programming_slides_2
37 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
chapt06
No ratings yet
chapt06
16 pages
Software Engineering: A Perspective For 2003: Linda Shafer Director
No ratings yet
Software Engineering: A Perspective For 2003: Linda Shafer Director
53 pages
Software Engineering
No ratings yet
Software Engineering
4 pages
Project Management: CS 425/625 Software Engineering
No ratings yet
Project Management: CS 425/625 Software Engineering
23 pages
SNS College of Engineering: Hci and The Web
No ratings yet
SNS College of Engineering: Hci and The Web
44 pages
B.E.Civil - Syl 2017
No ratings yet
B.E.Civil - Syl 2017
25 pages
Cs 65011
No ratings yet
Cs 65011
29 pages
Table 2013 Ug
No ratings yet
Table 2013 Ug
2 pages
Blackberry Tutorial: Academic Aptitude Test
No ratings yet
Blackberry Tutorial: Academic Aptitude Test
12 pages
NST32031-Practical For Wireless Network: Department of ICT Faculty of Technology South Eastern University of Sri Lanka
No ratings yet
NST32031-Practical For Wireless Network: Department of ICT Faculty of Technology South Eastern University of Sri Lanka
9 pages
Hybris Technical Assignement
No ratings yet
Hybris Technical Assignement
2 pages
6.ARRAYS - 37
100% (1)
6.ARRAYS - 37
13 pages
Microsoft Visual C# 2017 An Introduction to Object Oriented Programming 7th Edition by Joyce Farrell ISBN 1337102100 9781337102100 - The ebook is available for online reading or easy download
100% (9)
Microsoft Visual C# 2017 An Introduction to Object Oriented Programming 7th Edition by Joyce Farrell ISBN 1337102100 9781337102100 - The ebook is available for online reading or easy download
81 pages
5.exception Handling
No ratings yet
5.exception Handling
8 pages
JAVA FAQ Viva Questions With Answers
No ratings yet
JAVA FAQ Viva Questions With Answers
8 pages
WESTORICO 42 TM
No ratings yet
WESTORICO 42 TM
2 pages
C For Dummies: C Language Comparison Symbols
0% (1)
C For Dummies: C Language Comparison Symbols
4 pages
DBTrace 24577 16659933751
No ratings yet
DBTrace 24577 16659933751
79 pages
Ollydbg: Crash Course in Ollydbg
100% (1)
Ollydbg: Crash Course in Ollydbg
45 pages
Module 10 - User-Defined Functions 1
No ratings yet
Module 10 - User-Defined Functions 1
19 pages
Schneider Software
No ratings yet
Schneider Software
2 pages
Smoothing 2D Grids Petrel
No ratings yet
Smoothing 2D Grids Petrel
6 pages
Loading Data in +snowflake
No ratings yet
Loading Data in +snowflake
10 pages
QP_SOFTWARE DEVELOPMENT_L4_PROGRAMMING FUNDEMENTALS
No ratings yet
QP_SOFTWARE DEVELOPMENT_L4_PROGRAMMING FUNDEMENTALS
8 pages
Broadridge Interview Questions
No ratings yet
Broadridge Interview Questions
3 pages
MATLAB Programming Mathematical Problem Solutions De Gruyter STEM 1st Edition Dingyü Xue 2024 Scribd Download
No ratings yet
MATLAB Programming Mathematical Problem Solutions De Gruyter STEM 1st Edition Dingyü Xue 2024 Scribd Download
55 pages
String Handling Operations Question Paper
No ratings yet
String Handling Operations Question Paper
4 pages
Learncax Centre For Computational Technologies Pvt. LTD.: Learncax Blog: Parallelization of Udfs in Ansys Fluent
No ratings yet
Learncax Centre For Computational Technologies Pvt. LTD.: Learncax Blog: Parallelization of Udfs in Ansys Fluent
7 pages
MapReduce - Report
No ratings yet
MapReduce - Report
8 pages
DAA - LAB Programs
No ratings yet
DAA - LAB Programs
24 pages
Microservices DesignPatterns
No ratings yet
Microservices DesignPatterns
15 pages
Introduction
No ratings yet
Introduction
17 pages
DS Notes New 2.1 (1)
No ratings yet
DS Notes New 2.1 (1)
23 pages
SQL Server Execution Plans, 3rd Edition
100% (2)
SQL Server Execution Plans, 3rd Edition
515 pages

cuda

Uploaded by

cuda

Uploaded by

Parallel Programming with CUDA

• The current CUDA GPU Architecture is

• 8-series GPUs offer 50-200 GFLOPS

(Although double precision will be resolved with Fermi)

Thread – Distributed by the CUDA runtime

Block – A user defined group of 1 to 512 threads.

Shared Memory – per block low-latency memory to allow for

Global Memory – device level memory that may be shared

This includes dereferencing pointers,

To facilitate this data movement

Will be explained more in depth later…

• Kernel functions are CUDA extensions to

• There must no static variable declarations

• Functions must have a non-variable number

- IEEE 754 32-bit floating point support (incomplete support)

- Each SP is a 1.35 GHz processor (32 GFLOPS peak)

- Supports 32 and 64 bit integers

- 8,192 dynamically partitioned 32-bit registers

- Supports 768 threads in hardware (24 SIMT warps of 32 threads)

- Thread scheduling done in hardware

- 16KB of low-latency shared memory

- 2 Special Function Units (reciprocal square root, trig functions, etc)

Each GPU has 16 SMs…

• Is CUDA better than Cell??

• How do I utilize 12,000 threads??

• Is CUDA really relevant anyway, in world

By: Matthew Guidry

You might also like