Cuda Notes From Udacity Lecture

The document discusses efficient GPU programming techniques. It introduces CUDA kernel programming and describes how to cube values in parallel by assigning each thread to cube one element. It then discusses GPU programming patterns like map, reduce, gather, scatter and stencil that assign work to threads. It emphasizes optimizing memory access by maximizing arithmetic intensity, using faster shared memory over global memory, and coalescing memory reads. It also introduces the use of atomic operations and shared memory to coordinate thread work and avoid races when writing to global memory.

Uploaded by

J G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views

Cuda Notes From Udacity Lecture

Uploaded by

J G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

/* Lesson 1 --Code from Quiz */

#include <stdio.h>
__global__ void cube(float * d_out, float * d_in){
int thid= threadIdx.x;
float num = d_in[thid];
d_out[thid] = num * num * num;
}
int main(int argc, char ** argv) {
const int ARRAY_SIZE = 96;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
// generate the input array on the host
float h_in[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = float(i);
}
float h_out[ARRAY_SIZE];
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
cube<<<1, ARRAY_SIZE>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (int i =0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
/* LESSON 2 */
many threads solving a problem by working together
parallel communication patters
map one to one
transpose one to one
gather many to one
scatter one to many
stencil several to run

reduce all to one

scan/sort all to all
shared memory in block
shared global memory
thread specific memory
__synctreads(); is crucial when you seperate read/write values
must ensure all values are written before you can start reading thems
maximize arithmetic intensity math/memory
-minimize the time spent on memory per thread
-local memory > shared memory >> global memory
__shared__ float sh_array[128];
sh_array[index] = array[index] copies from global to shared
__syncthreads(); makes that operation is complete
make sure to coalesce memory --read from very close memory blocks
memory that has the lifetime of the threadblock
get an issue of having to write a lot of iterations to one array
has a class of functions called atomics
ie)
atomicAdd(&g[i],1) //adds 1 to g[i]
work around using atomicCAS() you can do anything!
__global__ void increment_naive(int *g)
{
// which thread is this?
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
i = i % ARRAY_SIZE;
g[i] = g[i] + 1;
}
__global__ void increment_atomic(int *g)
{
// which thread is this?
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
i = i % ARRAY_SIZE;
atomicAdd(& g[i], 1);
}
//Summary
gather scatter stensil transpose
SM, threads blocks, ordering
local, global, shared, atomics

//Efficient GPU programming

higharithmetic intensity --move to faster memory if you need to
local > shared > global
use coalesced global memory if you need global
avoid diverging threads (bad if statment design and bad for loops)
forced syncing after loops. make em all go through it n times

The Secret Language of Women
No ratings yet
The Secret Language of Women
1 page
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
7. Moving to Parallel - Addition of 2 Matrices
No ratings yet
7. Moving to Parallel - Addition of 2 Matrices
14 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
5-computation
No ratings yet
5-computation
13 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
cuda
No ratings yet
cuda
4 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Web GPU
0% (1)
Web GPU
40 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Parallel Scan in C CUda
No ratings yet
Parallel Scan in C CUda
3 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Week 11
No ratings yet
Week 11
21 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
HPC_codes
No ratings yet
HPC_codes
14 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Rishi
No ratings yet
Rishi
30 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
Google Colab Solution Activity
No ratings yet
Google Colab Solution Activity
5 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Class 10
No ratings yet
Class 10
13 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
42 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
CUDA
No ratings yet
CUDA
33 pages
6-computation
No ratings yet
6-computation
11 pages
Hpc file
No ratings yet
Hpc file
22 pages
Processors
No ratings yet
Processors
25 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
vertopal.com_Lab7_GPU (1)
No ratings yet
vertopal.com_Lab7_GPU (1)
10 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
airport09_E
No ratings yet
airport09_E
24 pages
Chicago Style Thesis Sample
100% (2)
Chicago Style Thesis Sample
4 pages
Lecture Note 2 - Forecasting Trends
No ratings yet
Lecture Note 2 - Forecasting Trends
60 pages
Automobile Engineering Department - L. J. Polytechnic Ahmedabad
No ratings yet
Automobile Engineering Department - L. J. Polytechnic Ahmedabad
21 pages
Ammonium Amtax Manual
No ratings yet
Ammonium Amtax Manual
118 pages
Ch1 (intro)
No ratings yet
Ch1 (intro)
25 pages
A1 ENS Price List 0515
No ratings yet
A1 ENS Price List 0515
25 pages
Tables To Graph
No ratings yet
Tables To Graph
4 pages
EN - NLDC - Updates On Vietnam Power System and Electricity Market
No ratings yet
EN - NLDC - Updates On Vietnam Power System and Electricity Market
18 pages
Evaluative Report Template 3
No ratings yet
Evaluative Report Template 3
6 pages
Presentation 1
No ratings yet
Presentation 1
20 pages
SteelSeries Matrix USD
No ratings yet
SteelSeries Matrix USD
5 pages
Research Vincent
No ratings yet
Research Vincent
38 pages
Quality Management System Technical Questionnaire: EN/ISO 13485:2016
No ratings yet
Quality Management System Technical Questionnaire: EN/ISO 13485:2016
22 pages
Sas 301
No ratings yet
Sas 301
7 pages
1 Organic and Biomolecular Chemistry, 2018, 16, 1402 Compressed
No ratings yet
1 Organic and Biomolecular Chemistry, 2018, 16, 1402 Compressed
18 pages
Gamasutra - 10 Years of Behavioral Game Design With Bungie's Research Boss2
No ratings yet
Gamasutra - 10 Years of Behavioral Game Design With Bungie's Research Boss2
8 pages
Kumpulan Quiz AKM III
No ratings yet
Kumpulan Quiz AKM III
10 pages
SEIWAKAI NEWSLETTER ISSUE 69
No ratings yet
SEIWAKAI NEWSLETTER ISSUE 69
23 pages
Os All Notes 1st Semester
No ratings yet
Os All Notes 1st Semester
133 pages
6 Weeks Training Report For PLC
No ratings yet
6 Weeks Training Report For PLC
51 pages
Virtual Storage Platform e Series Family Matrix
No ratings yet
Virtual Storage Platform e Series Family Matrix
5 pages
Masters and Johnson - The Sexual Response Cycle
No ratings yet
Masters and Johnson - The Sexual Response Cycle
3 pages
Maf661 Tuto Chapter 6 July2022
No ratings yet
Maf661 Tuto Chapter 6 July2022
2 pages
TABLET RETRIEVAL CHECKLIST Mwaa
No ratings yet
TABLET RETRIEVAL CHECKLIST Mwaa
2 pages
Test - 5th Grade
No ratings yet
Test - 5th Grade
2 pages
Jurnal Reduksi Elektrolit
No ratings yet
Jurnal Reduksi Elektrolit
7 pages
TERM 3 C.A.T 1 - (2024 Term 3) - Analysis - Report - 1728548998086 - 1728548999635
No ratings yet
TERM 3 C.A.T 1 - (2024 Term 3) - Analysis - Report - 1728548998086 - 1728548999635
8 pages
Spatiotemporal Analytics Jay Lee - The ebook with all chapters is available with just one click
100% (1)
Spatiotemporal Analytics Jay Lee - The ebook with all chapters is available with just one click
78 pages

Cuda Notes From Udacity Lecture

Uploaded by

Cuda Notes From Udacity Lecture

Uploaded by

/* Lesson 1 --Code from Quiz */

reduce all to one

//Efficient GPU programming

You might also like