100% found this document useful (1 vote)

834 views70 pages

GPU Insights for CPU Experts

The document discusses GPU architectures from a CPU perspective. It begins with an introduction to data parallelism and how graphics workloads that involve identical, independent computations on multiple data inputs can be executed in parallel on GPUs. It then covers GPU programming models and execution models like SIMT. The document outlines different approaches for exploiting data parallelism, including using multiple CPUs or a single program multiple data (SPMD) model on multiple CPUs.

Uploaded by

gautamd07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

834 views70 pages

GPU Insights for CPU Experts

Uploaded by

gautamd07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

GPU

Architectures
A CPU Perspec+ve

Derek Hower AMD Research 5/21/2013

With updates by David Wood

Goals
Data Parallelism: What is it, and how to exploit it?
Workload characterisHcs

Execu1on Models / GPU Architectures

MIMD (SPMD), SIMD, SIMT

GPU Programming Models

Terminology translaHons: CPU AMD GPU Nvidia GPU
Intro to OpenCL

Modern GPU Microarchitectures

i.e., programmable GPU pipelines, not their xed-funcHon predecessors

Advanced Topics: (Time permiYng)

The Limits of GPUs: What they can and cannot do
The Future of GPUs: Where do we go from here?

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallel
ExecuHon
on GPUs
Data Parallelism, Programming Models, SIMT

GPU ARCHITECTURES: A CPU PERSPECTIVE

Graphics Workloads
Streaming computaHon

GPU

GPU ARCHITECTURES: A CPU PERSPECTIVE

Graphics Workloads
Streaming computaHon on pixels

GPU

GPU ARCHITECTURES: A CPU PERSPECTIVE

Graphics Workloads
Iden2cal, Streaming computaHon on pixels

GPU

GPU ARCHITECTURES: A CPU PERSPECTIVE

Graphics Workloads
Iden2cal, Independent, Streaming computaHon on pixels

GPU

GPU ARCHITECTURES: A CPU PERSPECTIVE

Architecture Spelling Bee

P-A-R-A-L-L-E-L

Spell
Independent

GPU ARCHITECTURES: A CPU PERSPECTIVE

Generalize: Data Parallel Workloads

Iden2cal, Independent computaHon on mul2ple data inputs

0,7

=()

7,0

1,7

=()

6,0

2,7

=()

5,0

3,7

=()

4,0

GPU ARCHITECTURES: A CPU PERSPECTIVE

Nave Approach
Split independent work over mul1ple processors
0,7

1,7

2,7

3,7

CPU0

7,0

()
=

CPU1

6,0

()
=

CPU2

5,0

()
=

CPU3

4,0

()
=

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: A MIMD Approach

MulHple InstrucHon MulHple Data
Split independent work over mul1ple processors
Program

=()
Program

=()

0,7

CPU0

7,0

Memory Writeback
Fetch Decode Execute

1,7

CPU1

6,0

Memory Writeback
Fetch Decode Execute

2,7

CPU2

5,0

Memory Writeback
Fetch Decode Execute

3,7

CPU3

4,0

Memory Writeback
Fetch Decode Execute

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: A MIMD Approach

MulHple InstrucHon MulHple Data
Split independent work over mul1ple processors
Program

=()

0,7

CPU0

7,0

Memory Writeback
Fetch
Decode
Execute
When
work
is iden1cal
(same program):

1,7
CPU1
Program
Single Program MulHple
Data (SPMD)

Fetch Decode Execute Memory Writeback

of MIMD)
(Subcategory
=()
Program

=()
Program

=()

2,7

6,0

CPU2

5,0

Memory Writeback
Fetch Decode Execute

3,7

CPU3

4,0

Memory Writeback
Fetch Decode Execute

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: An SPMD Approach

Single Program MulHple Data
Split iden1cal, independent work over mul1ple processors
Program

=()
Program

=()

0,7

CPU0

7,0

Memory Writeback
Fetch Decode Execute

1,7

CPU1

6,0

Memory Writeback
Fetch Decode Execute

2,7

CPU2

5,0

Memory Writeback
Fetch Decode Execute

3,7

CPU3

4,0

Memory Writeback
Fetch Decode Execute

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: A SIMD Approach

Single InstrucHon MulHple Data
Split iden1cal, independent work over mul1ple execuHon units (lanes)
More ecient: Eliminate redundant fetch/decode
0,7
1,7
2,7

Program

=()

CPU0

3,7
Execute Memory

Execute
Memory
Fetch Decode
Memory
Execute
Memory
Execute

GPU ARCHITECTURES: A CPU PERSPECTIVE

7,0

Writeback

6,0

Writeback
Writeback

5,0

Writeback

4,0

SIMD: A Closer Look

One Thread + Data Parallel Ops Single PC, single register le
0,7
1,7
2,7

Program

=()

CPU0

3,7
Execute
Memory
Memory
Execute

Fetch Decode
Memory
Execute
Memory
Execute

7,0

Writeback

6,0

Writeback
Writeback

5,0

Writeback

4,0

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: A SIMT Approach

Single InstrucHon MulHple Thread
Split iden1cal, independent work over mul1ple lockstep threads
MulHple Threads + Scalar Ops One PC, MulHple register les
0,7

Program

=()

WF0
1,7
Memory Writeback
Execute

2,7

3,7
Execute
Memory Writeback

Fetch Decode

Execute Memory Writeback

7,0
6,0
5,0
4,0

Memory Writeback
Execute

GPU ARCHITECTURES: A CPU PERSPECTIVE

Terminology Headache #1

Its common to interchange

SIMD and SIMT

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallel ExecuHon Models

MIMD/SPMD

MulHple independent
threads

SIMD/Vector

One thread with wide

execuHon datapath

GPU ARCHITECTURES: A CPU PERSPECTIVE

SIMT

MulHple lockstep threads

ExecuHon Model Comparison

Example
Architecture

MIMD/SPMD

SIMD/Vector

MulHcore CPUs

x86 SSE/AVX

More general:
supports TLP

Can mix sequenHal & Easier to program

parallel code
Gather/Scamer
operaHons

Inecient for data

parallelism

Gather/Scamer can
be awkward

Pros

Cons

GPU ARCHITECTURES: A CPU PERSPECTIVE

SIMT

GPUs

Divergence kills
performance
19

GPUs and Memory

Recall: GPUs perform Streaming computaHon
Streaming memory access

GPU

DRAM latency: 100s of GPU cycles

How do we keep the GPU busy (hide memory latency)?

GPU ARCHITECTURES: A CPU PERSPECTIVE

Hiding Memory Latency

OpHons from the CPU world:
Caches

Need spaHal/temporal locality

OoO/Dynamic Scheduling
Need ILP

MulHcore/MulHthreading/SMT
Need independent threads

GPU ARCHITECTURES: A CPU PERSPECTIVE

MulHcore MulHthreaded SIMT

Many SIMT threads grouped together into GPU Core
SIMT threads in a group SMT threads in a CPU core
Unlike CPU, groups are exposed to programmers

MulHple GPU Cores

GPU
GPU Core

GPU Core

GPU ARCHITECTURES: A CPU PERSPECTIVE

MulHcore MulHthreaded SIMT

Many SIMT threads grouped together into GPU Core
SIMT threads in a group SMT threads in a CPU core
Unlike CPU, groups are exposed to programmers

MulHple GPU Cores

This is a GPU Architecture (Whew!)

GPU
GPU Core

GPU Core

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Component Names

AMD/OpenCL

Dereks CPU Analogy

Processing Element

Lane

SIMD Unit

Pipeline

Compute Unit

Core

GPU Device

Device

GPU Core

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Programming
Models
OpenCL

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Programming Models

CUDA Compute Unied Device Architecture
Developed by Nvidia -- proprietary
First serious GPGPU language/environment

OpenCL Open CompuHng Language

From makers of OpenGL
Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.

C++ AMP C++ Accelerated Massive Parallelism

Microsos
Much higher abstracHon that CUDA/OpenCL

OpenACC Open Accelerator

Like OpenMP for GPUs (semi-auto-parallelize serial code)
Much higher abstracHon than CUDA/OpenCL

GPU Programming Models

CUDA Compute Unied Device Architecture
Developed by Nvidia -- proprietary
First serious GPGPU language/environment

OpenCL Open CompuHng Language

From makers of OpenGL
Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.

C++ AMP C++ Accelerated Massive Parallelism

Microsos
Much higher abstracHon that CUDA/OpenCL

OpenACC Open Accelerator

Like OpenMP for GPUs (semi-auto-parallelize serial code)
Much higher abstracHon than CUDA/OpenCL

OpenCL
Early CPU languages were light abstracHons of physical hardware
E.g., C

Early GPU languages are light abstracHons of physical hardware

OpenCL + CUDA

GPU ARCHITECTURES: A CPU PERSPECTIVE

OpenCL
Early CPU languages were light abstracHons of physical hardware
E.g., C

Early GPU languages are light abstracHons of physical hardware

OpenCL + CUDA

GPU Architecture
GPU
GPU Core

GPU Core

GPU ARCHITECTURES: A CPU PERSPECTIVE

OpenCL
Early CPU languages were light abstracHons of physical hardware
E.g., C

Early GPU languages are light abstracHons of physical hardware

OpenCL + CUDA

GPU Architecture

OpenCL Model

GPU
GPU Core

NDRange
GPU Core

Workgroup

Work-item

Wavefront

GPU ARCHITECTURES: A CPU PERSPECTIVE

NDRange
N-Dimensional (N = 1, 2, or 3) index space
ParHHoned into workgroups, wavefronts, and work-items
NDRange
Workgroup

GPU ARCHITECTURES: A CPU PERSPECTIVE

Workgroup

Kernel
Run an NDRange on a kernel (i.e., a funcHon)
Same kernel executes for each work-item
Smells like MIMD/SPMD

Kernel
0,7

=()

7,0

1,7

=()

6,0

2,7

=()

5,0

3,7

=()

4,0

GPU ARCHITECTURES: A CPU PERSPECTIVE

Kernel
Run an NDRange on a kernel (i.e., a funcHon)
Same kernel executes for each work-item
Smells like MIMD/SPMDbut beware, its not!

Kernel

Workgroup

0,7

=()

7,0

1,7

=()

6,0

2,7

=()

5,0

3,7

=()

4,0

GPU ARCHITECTURES: A CPU PERSPECTIVE

OpenCL Code

__kernel
void flip_and_recolor(__global float3 **in_image,
__global float3 **out_image,
int img_dim_x, int img_dim_y)
{
int x = get_global_id(1); // get work-item id in dim 1
int y = get_global_id(2); // get work-item id in dim 2
out_image[img_dim_x - x][img_dim_y - y] =
recolor(in_image[x][y]);
}
GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU
Microarchitecture
AMD Graphics Core Next

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Hardware Overview

GPU
GDDR5
GPU

Local Memory

GPU ARCHITECTURES: A CPU PERSPECTIVE

SIMT

L1 Cache
SIMT

SIMT

L1 Cache

SIMT

GPU Core

SIMT

GPU Core

L2 Cache

Local Memory

Compute Unit A GPU Core

Compute Unit (CU) Runs Workgroups

Workgroup

Contains 4 SIMT Units

Picks one SIMT Unit per cycle for scheduling

SIMT Unit Runs Wavefronts

Each SIMT Unit has 10 wavefront instrucHon buer
Takes 4 cycles to execute one wavefront

SIMT

L1 Cache

Local Memory

10 Wavefront x 4 SIMT Units =

40 Ac1ve Wavefronts / CU
64 work-items / wavefront x 40 acHve wavefronts =
2560 Ac1ve Work-items / CU

GPU ARCHITECTURES: A CPU PERSPECTIVE

Compute Unit Timing Diagram

Time

On average: fetch & commit one wavefront / cycle

1
2
3
4
5
6
7
8
9
10
11
12

SIMT0

SIMT1

WF1_0
WF1_1
WF1_2
WF1_3
WF5_0
WF5_1
WF5_2
WF5_3
WF9_0
WF9_1
WF9_2
WF9_3

WF2_0
WF2_1
WF2_2
WF2_3
WF6_0
WF6_1
WF6_2
WF6_3
WF10_0
WF10_1
WF10_2

SIMT2

SIMT3

WF3_0
WF3_1
WF3_2
WF3_3
WF7_0
WF7_1
WF7_2
WF7_3
WF11_0
WF11_1

WF4_0
WF4_1
WF4_2
WF4_3
WF8_0
WF8_1
WF8_2
WF8_3
WF12_0

GPU ARCHITECTURES: A CPU PERSPECTIVE

SIMT Unit A GPU Pipeline

Like a wide CPU pipeline except one fetch for enHre width
16-wide physical ALU
Executes 64-wavefront over 4 cycles. Why??

64KB register state / SIMT Unit

Compare to x86 (Bulldozer): ~1KB of physical register le state (~1/64 size)

Address Coalescing Unit

Registers

A key to good memory performance

Address Coalescing Unit

GPU ARCHITECTURES: A CPU PERSPECTIVE

Address Coalescing
Wavefront: Issue 64 memory requests

NDRange
Workgroup

GPU ARCHITECTURES: A CPU PERSPECTIVE

Workgroup

Address Coalescing
Wavefront: Issue 64 memory requests
Common case:
work-items in same wavefront touch same cache block

Coalescing:
Merge many work-items requests into single cache block request

Important for performance:

Reduces bandwidth to DRAM

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Memory
GPUs have caches.

GPU ARCHITECTURES: A CPU PERSPECTIVE

Not Your CPUs Cache

By the numbers: Bulldozer FX-8170 vs. GCN Radeon HD 7970
CPU (Bulldozer)

GPU (GCN)

L1 data cache capacity

16KB

16 KB

AcHve threads (work-items)

sharing L1 D Cache

2560

L1 dcache capacity / thread

16KB

6.4 bytes

Last level cache (LLC) capacity

8MB

768KB

AcHve threads (work-items)

sharing LLC

81,920

LLC capacity / thread

1MB

9.6 bytes

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Caches
Maximize throughput, not hide latency
Not there for either spaHal or temporal locality

L1 Cache: Coalesce requests to same cache block by dierent work-items

i.e., streaming thread locality?
Keep block around just long enough for each work-item to hit once
UlHmate goal: Reduce bandwidth to DRAM

L2 Cache: DRAM staging buer + some instrucHon reuse

UlHmate goal: Tolerate spikes in DRAM bandwidth

If there is any spaHal/temporal locality:

Use local memory (scratchpad)

GPU ARCHITECTURES: A CPU PERSPECTIVE

Scratchpad Memory
GPUs have scratchpads (Local Memory)

Allocated to a workgroup
i.e., shared by wavefronts in workgroup

GPU ARCHITECTURES: A CPU PERSPECTIVE

SIMT

Rename address
Manage capacity manual ll/evicHon

SIMT

Separate address space

Managed by sosware:

L1 Cache

Local Memory

Example System: Radeon HD 7970

High-end part
32 Compute Units:
81,920 AcHve work-items
32 CUs * 4 SIMT Units * 16 ALUs = 2048 Max FP ops/cycle
264 GB/s Max memory bandwidth

925 MHz engine clock

3.79 TFLOPS single precision (accounHng trickery: FMA)

210W Max Power (Chip)

>350W Max Power (card)
100W idle power (card)

GPU ARCHITECTURES: A CPU PERSPECTIVE

Radeon HD 7990 - Cooking

Two 7970s on one card:
375W (AMD Ocial) 450W (OEM)

GPU ARCHITECTURES: A CPU PERSPECTIVE

A Rose by Any
Other Name
GPU ARCHITECTURES: A CPU PERSPECTIVE
48

Terminology Headaches #2-5

Nvidia/CUDA

AMD/OpenCL

Dereks CPU Analogy

CUDA Processor

Processing Element

Lane

CUDA Core
GPU Core

Streaming
MulHprocessor

GPU Device

SIMD Unit

Pipeline

Compute Unit

Core

GPU Device

Device

GPU ARCHITECTURES: A CPU PERSPECTIVE

Terminology Headaches #6-9

CUDA/Nvidia

OpenCL/AMD

Henn&Pai

Work-item

Sequence of
SIMD Lane
OperaHons

Wavefront

Thread of
SIMD
InstrucHons

Block

Workgroup

Body of
vectorized
loop

Grid

NDRange

Vectorized
loop

Thread

Warp
Group

GPU ARCHITECTURES: A CPU PERSPECTIVE

Terminology Headache #10

GPUs have scratchpads (Local Memory)

Allocated to a workgroup
i.e., shared by wavefronts in workgroup

SIMT

Rename address
Manage capacity manual ll/evicHon

SIMT

Separate address space

Managed by sosware:

L1 Cache

Local Memory

Nvidia calls Local Memory

Shared Memory.
AMD some1mes calls it Group Memory.
GPU ARCHITECTURES: A CPU PERSPECTIVE

Recap
Data Parallelism: IdenHcal, Independent work over mulHple data inputs
GPU version: Add streaming access pamern

Data Parallel Execu1on Models: MIMD, SIMD, SIMT

GPU Execu1on Model: MulHcore MulHthreaded SIMT
OpenCL Programming Model
NDRange over workgroup/wavefront

Modern GPU Microarchitecture: AMD Graphics Core Next (GCN)

Compute Unit (GPU Core): 4 SIMT Units
SIMT Unit (GPU Pipeline): 16-wide ALU pipe (16x4 execuHon)
Memory: designed to stream

GPUs: Great for data parallelism. Bad for everything else.

GPU ARCHITECTURES: A CPU PERSPECTIVE

Advanced Topics
GPU LimitaAons, Future of GPGPU

Choose Your Own Adventure!

SIMT Control Flow & Branch Divergence
Memory Divergence
When GPUs talk
Wavefront communicaHon
GPU coherence
GPU consistency

Future of GPUs: Whats next?

SIMT Control Flow

Consider SIMT condiHonal branch:
One PC
MulHple data (i.e., mulHple condiHons)

if (x <= 0)
y = 0;
else
y = x;

SIMT Control Flow

Work-items in wavefront run in lockstep
Dont all have to commit

Branching through predica1on

AcHve lane: commit result

InacHve lane: throw away result

All lanes acHve at start: 1111

if (x <= 0)
y = 0;
else
y = x;

Branch set execuHon mask: 1000

Else invert execuHon mask: 0111
Converge Reset execuHon mask: 1111

SIMT Control Flow

Work-items in wavefront run in lockstep
Dont all have to commit

Branching through predica1on

AcHve lane: commit result

InacHve lane: throw away result

All lanes
acHve at start: 1111
Branch
divergence

if (x <= 0)
y = 0;
else
y = x;

Branch set execuHon mask: 1000

Else invert execuHon mask: 0111
Converge Reset execuHon mask: 1111

Branch Divergence
When control ow diverges, all lanes take all paths

Divergence Kills Performance

GPU ARCHITECTURES: A CPU PERSPECTIVE

Beware!
Divergence isnt just a performance problem:

__global int lock = 0;

void mutex_lock()
{

// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}

GPU ARCHITECTURES: A CPU PERSPECTIVE

Beware!
Divergence isnt just a performance problem:

__global int lock = 0;

void mutex_lock()
{
Deadlock:
work-items cant enter mutex together!

// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}

GPU ARCHITECTURES: A CPU PERSPECTIVE

Memory Bandwidth
SIMT

DRAM

Lane 0

Bank 0

Lane 1

Bank 1

Lane 2

Bank 2

Lane 3

Bank 3

-- Parallel Access
GPU ARCHITECTURES: A CPU PERSPECTIVE

Memory Bandwidth
SIMT

DRAM

Lane 0

Bank 0

Lane 1

Bank 1

Lane 2

Bank 2

Lane 3

Bank 3

-- Sequen1al Access
GPU ARCHITECTURES: A CPU PERSPECTIVE

Memory Bandwidth
Memory divergence
SIMT

DRAM

Lane 0

Bank 0

Lane 1

Bank 1

Lane 2

Bank 2

Lane 3

Bank 3

-- Sequen1al Access
GPU ARCHITECTURES: A CPU PERSPECTIVE

Memory Divergence
One work-item stalls enHre wavefront must stall
Cause: Bank conicts, cache misses

Data layout & parHHoning is important

GPU ARCHITECTURES: A CPU PERSPECTIVE

Memory Divergence
One work-item stalls enHre wavefront must stall
Cause: Bank conicts, cache misses

Data layout & parHHoning is important

Divergence Kills Performance

GPU ARCHITECTURES: A CPU PERSPECTIVE

CommunicaHon and SynchronizaHon

Work-items can communicate with:
Work-items in same wavefront
No special sync neededthey are lockstep!

Work-items in dierent wavefront, same workgroup (local)

Local barrier

Work-items in dierent wavefront, dierent workgroup (global)

OpenCL 1.x: Nope
OpenCL 2.x: Yes, but
CUDA 4.x: Yes, but complicated

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Consistency Models

Very weak guarantee:
Program order respected within single work-item
All other bets are o

Safety net:
Fence make sure all previous accesses are visible before proceeding
Built-in barriers are also fences

A wrench:
GPU fences are scoped only apply to subset of work-items in system
E.g., local barrier

Take-away: Area of acHve research

See Hower, et al. Heterogeneous-race-free Memory Models, ASPLOS 2014

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Coherence?
NoHce: GPU consistency model does not require coherence
i.e., Single Writer, MulHple Reader

MarkeHng claims they are coherent

GPU Coherence:
Nvidia: disable private caches
AMD: ush/invalidate enHre cache at fences

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Architecture Research

Blending with CPU architecture:
Dynamic scheduling / dynamic wavefront re-org
Work-items have more locality than we think

Tighter integraHon with CPU on SOC:

Fast kernel launch
Exploit ne-grained parallel region: Remember Amdahls law

Common shared memory

Reliability:
Historically: Who noHces a bad pixel?
Future: GPU compute demands correctness

Power:
Mobile, mobile mobile!!!
GPU ARCHITECTURES: A CPU PERSPECTIVE

Computer Economics 101

GPU Compute is cool + gaining steam, but
Is a 0 billion dollar industry (to quote Mark Hill)

GPU design prioriHes:

1. Graphics
2. Graphics

N-1. Graphics
N. GPU Compute

Moral of the story:

GPU wont become a CPU (nor should it)

GPU ARCHITECTURES: A CPU PERSPECTIVE

Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
Image Processing with CUDA on GPU
No ratings yet
Image Processing with CUDA on GPU
87 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Nvidia Cuda
No ratings yet
Nvidia Cuda
26 pages
Modern GPU
100% (1)
Modern GPU
221 pages
Introduction To GPU Architecture: © 2006 University of Central Florida
100% (1)
Introduction To GPU Architecture: © 2006 University of Central Florida
41 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
Basics of Memory Design
No ratings yet
Basics of Memory Design
46 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Survey of Deep Learning Accelerators
No ratings yet
Survey of Deep Learning Accelerators
44 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Lecture 1 Introduction To Computer Architecture and Organization
No ratings yet
Lecture 1 Introduction To Computer Architecture and Organization
69 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
BeagleBone and Linux
80% (5)
BeagleBone and Linux
11 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
CUDA Parallel Programming Patterns
No ratings yet
CUDA Parallel Programming Patterns
35 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
FPGA Types and Architectures Overview
No ratings yet
FPGA Types and Architectures Overview
20 pages
Introduction to CUDA C/C++ Programming
No ratings yet
Introduction to CUDA C/C++ Programming
76 pages
Computer Arichitecture
No ratings yet
Computer Arichitecture
60 pages
GPU Programming in MATLAB
No ratings yet
GPU Programming in MATLAB
6 pages
Fpga VHDL Front
0% (3)
Fpga VHDL Front
1 page
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
NVIDIA CUDA Programming Guide 2.0
100% (3)
NVIDIA CUDA Programming Guide 2.0
107 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
NVIDIA Hardware Engineer Interview Questions
No ratings yet
NVIDIA Hardware Engineer Interview Questions
9 pages
GPU-Accelerated Linear Algebra Techniques
No ratings yet
GPU-Accelerated Linear Algebra Techniques
114 pages
CUDA Memory for HPC Students
No ratings yet
CUDA Memory for HPC Students
27 pages
Lec05 Introduction To Macros and SRAM Lint
No ratings yet
Lec05 Introduction To Macros and SRAM Lint
48 pages
LTE Implementation on Xilinx FPGA
100% (1)
LTE Implementation on Xilinx FPGA
294 pages
Programming Manual TM NPU Micropython en-US en-US
No ratings yet
Programming Manual TM NPU Micropython en-US en-US
44 pages
FPGA Tutorial
100% (1)
FPGA Tutorial
424 pages
Fpga Vs Asic Design Flow
No ratings yet
Fpga Vs Asic Design Flow
32 pages
ARM® Cortex® M4 Cookbook - Sample Chapter
100% (3)
ARM® Cortex® M4 Cookbook - Sample Chapter
29 pages
GPU Programming and Parallelism
No ratings yet
GPU Programming and Parallelism
16 pages
Implementation of FPGA Based Image Processing Algorithm Using PDF
No ratings yet
Implementation of FPGA Based Image Processing Algorithm Using PDF
6 pages
FPGA Course4 CapstoneProjectGuideModule3
No ratings yet
FPGA Course4 CapstoneProjectGuideModule3
63 pages
Raphics Rocessing NIT: Nust College of Electrical and Mechanical Engineering
No ratings yet
Raphics Rocessing NIT: Nust College of Electrical and Mechanical Engineering
27 pages
SystemC Overview and Coding Guide
No ratings yet
SystemC Overview and Coding Guide
110 pages
AI Accelerator
No ratings yet
AI Accelerator
5 pages
DAC 0800 Waveform Generation with 8051
No ratings yet
DAC 0800 Waveform Generation with 8051
14 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Real-Time Concepts for Embedded Systems
No ratings yet
Real-Time Concepts for Embedded Systems
5 pages
Introduction to CUDA C/C++ Basics
100% (1)
Introduction to CUDA C/C++ Basics
82 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 3
No ratings yet
Lec 3
48 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
RV64V: A Vector Architecture Overview
No ratings yet
RV64V: A Vector Architecture Overview
29 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
Technical & - Service - Manual-TATA Green Battery PDF
100% (3)
Technical & - Service - Manual-TATA Green Battery PDF
57 pages
KG Lesson of Festival (15-16 & 19 - 21 January - 2025
No ratings yet
KG Lesson of Festival (15-16 & 19 - 21 January - 2025
2 pages
APEL C Guidelines OUM Aug 2018 PDF
No ratings yet
APEL C Guidelines OUM Aug 2018 PDF
10 pages
Mkt111 Elements of Marketing. MKT Study Group
No ratings yet
Mkt111 Elements of Marketing. MKT Study Group
15 pages
Construction Data Management Guide
100% (1)
Construction Data Management Guide
15 pages
Karel J. Robot: A Gentle Introduction To The Art of Object-Oriented Programming in Java
100% (1)
Karel J. Robot: A Gentle Introduction To The Art of Object-Oriented Programming in Java
20 pages
Ak Fabtech - Complete Profile
No ratings yet
Ak Fabtech - Complete Profile
28 pages
U S - Federal Reserve Board
No ratings yet
U S - Federal Reserve Board
27 pages
Encounter Essay TEMEN OBLAK
No ratings yet
Encounter Essay TEMEN OBLAK
7 pages
Present Perfect Tense Guide
No ratings yet
Present Perfect Tense Guide
10 pages
STPB740
No ratings yet
STPB740
12 pages
Exhaust Fan Installation Details
No ratings yet
Exhaust Fan Installation Details
1 page
Scheme of Work For An English Lesson
100% (1)
Scheme of Work For An English Lesson
3 pages
Free Verse
No ratings yet
Free Verse
7 pages
The Family Plan
No ratings yet
The Family Plan
9 pages
B32-RDBMS Assignment Question
No ratings yet
B32-RDBMS Assignment Question
4 pages
Karnataka 1st PUC Maths Question Bank Chapter 1 Sets
No ratings yet
Karnataka 1st PUC Maths Question Bank Chapter 1 Sets
21 pages
Dapa 1669 140 02 Site Analysis Doc Redacted
No ratings yet
Dapa 1669 140 02 Site Analysis Doc Redacted
48 pages
2004 Bombardier Outlander 330 400 Max Max XT Electronic Version
No ratings yet
2004 Bombardier Outlander 330 400 Max Max XT Electronic Version
1,040 pages
Motorola PMLN6588 (NNTN8410A)
No ratings yet
Motorola PMLN6588 (NNTN8410A)
28 pages
Pharma Excipients & MCC Solutions
No ratings yet
Pharma Excipients & MCC Solutions
10 pages
TI-36X Pro Guidebook
100% (1)
TI-36X Pro Guidebook
78 pages
Question & Answers: Commercial Negotiation
100% (2)
Question & Answers: Commercial Negotiation
12 pages
Cet-2 Winter 2024
No ratings yet
Cet-2 Winter 2024
2 pages
Unit 2 Kalidasa An Overview
No ratings yet
Unit 2 Kalidasa An Overview
2 pages
000 3DT Ee 03584 000
100% (5)
000 3DT Ee 03584 000
135 pages
Bill Gross Investment Outlook May - 07
No ratings yet
Bill Gross Investment Outlook May - 07
9 pages
Adjectives and Adverbs
No ratings yet
Adjectives and Adverbs
26 pages
Psychosocial Support Activity Pack
100% (1)
Psychosocial Support Activity Pack
35 pages
1 H Unit Packet
No ratings yet
1 H Unit Packet
31 pages