0% found this document useful (0 votes)

7 views

2_DataflowAnalysis

The document introduces dataflow concepts essential for designing deep learning accelerators, emphasizing the significance of data movement and energy costs associated with memory access. It discusses various dataflow patterns, such as weight-stationary and input-stationary, and their implications on hardware efficiency. The presentation also outlines methods for mapping these dataflows to hardware, including considerations for buffering and energy costs.

Uploaded by

qtxstq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

2_DataflowAnalysis

Uploaded by

qtxstq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

INTRODUCTION TO DATAFLOW FOR

DEEP LEARNING ACCELERATOR

DESIGN
Michael Pellauer*, Hyoukjun Kwon** and Tushar Krishna**

*Architecture Research Group, NVIDIA

**Georgia Institute of Technology

ACKNOWLEDGMENTS

Many slides from: Professors Joel Emer (NVIDIA, MIT) and Vivienne Sze’s (MIT) class
• 6.S082/6.888 Hardware Architectures for Deep Learning (Fall, 2017)
Also includes slides from: Angshuman Parashar, Senior Research Scientist (NVIDIA)

Also influenced by: Former Nvidia interns:

Steve Keckler (NVIDIA) Yu-Hsin Chen (MIT)

Jason Clemons (NVIDIA) Anurag Mukkara (MIT)
Sophia Shao (NVIDIA) Animesh Jain (U Mich.)
Cristopher Fletcher (UIUC)
2
TALK OUTLINE

Motivation and Background

• Why architecture researchers should care about dataflow

Concepts in Tiling, Blocking, and Scheduling

• Basic example: 1D-convolution
• Adding intermediate staging buffers
• Adding spatial parallelism

Extending to full CNN layers

• The need for analytic modeling

3
ACCELERATORS ARE GREAT.... BUT!

Custom
Datapath

Off-Chip
Memory
4
MOTIVATION: DATA MOVEMENT

Energy costs
8-bit Integer Multiply 1x
Why it’s important
Fetch two 8-bit operands from DRAM ~100x
Fetch two 8-bit operands from large SRAM ~10x

VGG16 conv 3_2

Multiply Add Ops 1.85 Billion
Fortunately… Weights 590 K
Re-use
Inputs 803 K
Outputs 803 K

5
MAPPING REUSE TO HARDWARE
7-dimensional network layer 2D hardware array
Weights Inputs
C K Outputs PE PE PE PE

R PE PE PE PE
Y Y’
map
K S PE PE PE PE
C C
X X’
. . PE PE PE PE
. N . N
. .
Algorithmic Hardware
Reuse Reuse

• 7D Computation Space Temporal DRAM Buf RF *

• R*S*X*Y*C*K*N

• 4D Operand / Result Spaces – Multicast Forwarding

• Weights – R * S * C * K
• Inputs – X * Y * C * N
• Outputs – X’ * Y’ * K * N Millions of non-trivial mappings
Efficiency is dependent on concrete dimension sizes
6
LEARNING ABOUT DATAFLOWS

7
PEDAGOGICAL EXAMPLE: 1-D CONVOLUTION
Weights Inputs Outputs*

* =
S X X’ = X-ceil(S/2)† * The terms “Output”
and “Partial Sum”
int i[X]; # Input activations used interchangeably
†Assuming: ‘valid’
int w[S]; # Filter weights
int o[X’]; # Output activations style convolution

for (x = 0; x < X’; x++) {

for (s = 0; s < S; s++) {
o[x] += i[x+s]*w[s];
}

How often does the datapath Every cycle

change the weight and input?
Output? Every S cycles: “Output stationary”
8
WHAT DO WE MEAN BY “STATIONARY”?
The datatype (and dimension) that changes most slowly
Sums: 1/10, Inputs: 3/10, Weights: 9/40
4

Bandwidth In
Note: the
2 “stationary” name is meant to give intuition, not to be a
complete specification of all the behavior of a dataflow
0
Time

sums inputs weights

Imprecise analogy: think of data transfers as a wave with “amplitude” and “period”
• The stationary datatype has the longest period (locally held tile changes most slowly)
• Note: like waves, also can have harmful “constructive interference” (bursts)
• Later we will see how intermediate staging buffers reduce both bandwidth and energy
Often corresponds to datatype that is “done with” earliest without further reloads 9
“DONE WITH” VERSUS “NEEDS RELOAD”

int i[X]; # Input activations

int w[S]; # Filter weights
int o[X’]; # Output activations
How many times will x == 2?
How many times will
for (x = 0; x < X’; x++) { s == 2?
for (s = 0; s < S; s++) {
o[x] += i[x+s]*w[s];
}

How many times will x+s == 2?

• Temporal distance between re-occurrence dictates buffer size to avoid re-load

• How do you know if a buffer that size is worth it?
10
FROM “LOOP NEST” TO DATAFLOW
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[X]; # Input activations

int w[S]; # Filter weights
int o[X’]; # Output activations No constraints*
on loop
for (x = 0; x < X’; x++) { permutations!
for (s = 0; s < S; s++) {
o[x] += i[x+s]*w[s];
}

* Because we don’t care about where precision/saturation issues

occur – usually choose data sizes such that it never happens
[See NVDLA’s 48-bit accumulators for 16-bit operands] 11
ANOTHER DATAFLOW
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[X]; # Input activations

int w[S]; # Filter weights
int o[X’]; # Output activations

for (s = 0; s < S; s++) {

for (x = 0; x < X’; x++) {
o[x] += i[x+s]*w[s];
}

What dataflow is this? Weight stationary

12
MORE DATAFLOWS
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)†

How can we
int i[X]; # Input activations implement
int w[S]; # Filter weights input
int o[X’]; # Output activations stationary
with no input
for (x = 0; x < X’; x++) { index?
for (s = 0; s < S; s++) {
o[x] += i[x+s]*w[s];
}

13
INPUT STATIONARY
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[X]; # Input activations

int w[S]; # Filter weights
int o[X’] # Output activations

for (x = 0; x < X; x++) {

for (s = 0; s < S; s++) {
o[x-s] += i[x]*w[s];
}

Beware x-s must be >= 0 and <X’

14
SIMPLE MODEL FOR MAPPING DATAFLOWS TO
HARDWARE
Weights Inputs Outputs
=
S
* X X’ = X-ceil(S/2)

Weights

Inputs
×
Partial Sums
Single Multiplier
sequentially
Local Buffers computes all partial
sums
Backing Store

Common metric Weights Inputs Outputs / Partial Sums

Alg. Min. accesses to backing store (MINALG) S X X’
Maximum operand uses (MAXOP) SX' SX' SX'

15
Weights
1D CONVOLUTION – SUMMARY Inputs Outputs
=
S
* X X’

Common metric Weights Inputs Outputs / Partial Sums

Note: product
Size = Alg. Min. accesses S X X’
always equals SX’
Maximum operand uses SX' SX' SX'
BUFSIZE-1D (Buffer size for zero re-fetch) BUFMULT-1D (#times full buffer is accessed)
Dataflow Weights Inputs Outputs Dataflow Weights Inputs Outputs
Weight-stationary 1 X’ X’ Weight-stationary SX’ S S
Input-stationary S 1 S Input-stationary X’ SX’ X’
Output-stationary S S 1 Output-stationary X’ X’ SX’

WS = SX’[f(1) + f(X’) + 2f(X’)]

Buffer access energy: f(x) = energy cost of
IS = SX’[f(S) + f(1) + 2f(S)] accessing a RAM structure
of size x
OS = SX’[f(S) + f(S) + 2f(1)]

Significant difference in buffer access energy cost based on dataflow

But what if the provisioned buffering is smaller than required? 16
GETTING MORE REALISTIC

17
MULTI-LAYER BUFFERING
L1
L0 L0
Weights Weights

L1
L0 L0 Data
Inputs Inputs path

L0
L1 L0
Outputs Outputs

18
1-D CONVOLUTION – BUFFERED
Weights Inputs Outputs

* =
S X X’ = X-ceil(X/2)

int i[X]; # Input activations

int w[S]; # Filter Weights
int o[X’]; # Output activations
Note X’ and S
// Level 1 are factored so:
for (x1 = 0; x1 < X’1; x1++) { X’0 * X’1 = X’
for (s1 = 0; s1 < S1; s1++) { S0 * S1 = S
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
x = x1 * X’0 + x0;
s = r1 * R0 + r0;
o[x] += i[x+s] * w[s];
} 19
ENERGY COSTS OF A MAPPING
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0] * w[s1*S0+s0];
}

Constant over each level 1 iteration

Energy of a buffer access is a function of the size of For level L>0 there are three components:
the buffer
Data arriving from level L+1
Each buffer level’s energy is proportional the
Data that needs to be transferred to
number of accesses at that level
level L-1
For level 0 that is all the operands to the
Datapath Data that is returned from level L-1
20
MAPPING - WEIGHT ACCESS COSTS
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0] * w[s1*S’0+s0];
}

Weights
s1=0
S0
s1++
S0

s1++
S0
Next x1 iteration
s1=0
21
S0
MAPPING - WEIGHT ACCESS COSTS

Level 0 reads
Per level 1 iteration -> X’0*S0 weight reads
Times X’1*S1 level 1 iterations
Total reads = (X’0*S0)*(X’1*S1) = (X’0*X’1)*(S0*S1) = SX’ reads

Level 1 to 0 transfers
Per level 1 iteration -> S0 weights transferred
Times same number of level 1 iterations = X’1 * S1
Total transfers -> S0*(X’1*S1) = X’1*(S0*S1) = SX’1

Disjoint/partitioned reuse pattern

22
MAPPING - INPUT ACCESS COSTS
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0] * w[s1*S0+s0];
}

Inputs
s1=0
X’0+S0

s1++
S0 X’0+S0 Sliding window
s1++
S0 X’0+S0
Next x1 iteration Input halo!
s1=0
x0 X’0+S0 23
S
MAPPING - INPUT ACCESS COSTS
Level 0 reads
Per level 1 iteration -> X’0+S0 inputs reads
Times X’1*S1 level 1 iterations
Total reads = X’1*S1*(X’0+S0) = ((X’1*X’0)*S1)+(X’1*(S1*S0)) = X’*S1+X’1*S reads

Level 1 to 0 transfers
For s=0, X’0+S0 inputs transferred
For each of the following S1-1 iterations another S0 inputs transferred
So total per x1 iteration is: X’0+S0*S1 = X’0+S inputs
Times number of x1 iterations = X’1
So total transfers = X’1*(X’0+S) = (X’1*X’0)+X’1*S = X’+X’1*S

Sliding window/partitioned reuse pattern

October 25, 24
MAPPING – OUTPUT ACCESS COSTS
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0] * w[s1*S0+s0];
}

Outputs
s1=0
X’0

s1++
X’0

Next x1 iteration

s1=0
25
X’0
MAPPING - OUTPUT ACCESS COSTS

Level 0 writes
Due to level 0 being ‘output stationary’ only X’0 writes per level 1 iteration
Times X’1*S1 level 1 iterations
Total writes = X’0*(X’1*S1) = (X’0*X’1)*S1 = X’*S1 writes

Level 0 to 1 transfers
After each S1 iterations a completed partial sum for X’0 outputs are transferred
There are X’1 chunks of S1 iterations
So total is X’1*X’0 = X’ transfers

October 25, 26
MAPPING DATA COST SUMMARY
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]* w[s1*S0+s0];
}

Level 0 Level 1 to 0 transfers

Weight Reads SX’ SX’1

Input Reads X’ * S1+ X’1 * S X’+X’1*S

Output Reads N/A N/A

Output Writes X’ * S1 X’
27
SPATIAL PES
L0
Weights
PE0
L0
Inputs
PE1
L0
Outputs

How will this be reflected New ‘level’ of loops

in the loop nest?
Assuming R=3 28
1-D CONVOLUTION – SPATIAL
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[X]; # Input activations Note:

int w[S]; # Filter Weights X’0*X’1 = X’
int o[X’]; # Output activations S0*S1 = S

// Level 1 X’1 = 2
parallel-for (x1 = 0; x1 < X’1; x1++) {
parallel-for (s1 = 0; s1 < S1; s1++) { S1 = 1 => s1 = 0
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
29
1-D CONVOLUTION – SPATIAL
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[X]; # Input activations

int w[S]; # Filter Weights
int o[X’]; # Output activations

// Level 1
parallel-for (x1 = 0; x1 < 2; x1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}

30
SPATIAL PES

L0 PE0 PE1
>>w[0] >>w[0]
Weights
PE0

L0
Inputs
PE1

L0
Outputs

Implementation opportunity? Yes, single fetch and multicast 31

1-D CONVOLUTION – SPATIAL

// Level 1
parallel-for (x1 = 0; x1 < 2; x1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}

How do we recognize multicast opportunities?

Indices independent of spatial index

October 25, 32
SPATIAL PES: PARTITIONED INPUTS
L0 PE0 PE1
>>w[0] >>w[0]
Weights >>i[0] >>i[X’0+0]
PE0 >>w[1] >>w[1]
>>i[1] >>i[X’0+1]
L0 >>w[2] >>w[2]
Inputs >>i[2] >>i[X’0+2]
PE1 <<o[0] <<o[X’0+0]

L0 >>w[0] >>w[0]
Outputs >>i[1] >>i[X’0+1]
>>w[1] >>w[1]
>>i[2] >>i[X’0+2]
>>w[2] >>w[2]
Implementation opportunity? Parallel fetch >>i[3] >>i[X’0+3]
<<o[1] <<o[X’0+1]
33
Assuming S=3
SPATIAL PARTITIONING WEIGHTS
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[W]; # Input activations Note:

int w[R]; # Filter Weights X’0*X’1 = X’
int o[E]; # Output activations S0*S1 = S

// Level 1
parallel-for (x1 = 0; x1 < X’1; x1++) { X’1 = 1 => x1 = 0
parallel-for (s1 = 0; s1 < S1; s1++) { S0 = 1, S1 = 2
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}

34
SPATIAL PARTITIONING WEIGHTS
Weights Inputs Outputs

* =
S X X’ = X-ceil(S/2)

int i[X]; # Input activations Note:

int w[S]; # Filter Weights X’0*X’1 = X’
int o[X’]; # Output activations S0*S1 = S

// Level 1
parallel-for (s1 = 0; s1 < 2; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}

35
1-D CONVOLUTION – SPATIAL

// Level 1
parallel-for (s1 = 0; s1 < 2; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}

How do we handle same index for output Spatial summation…

In multiple PEs?

Other multicast opportunities? No

36
SPATIAL PES: PARTITIONED WEIGHTS
L0 PE0 PE1
>>w[0] >>w[S0+0]
Weights >>i[0] >>i[S0+0]
PE0 >>w[1] >>w[S0+1]
>>i[1] >>i[S0+1]
L0 >>w[2] >>w[S0+2]
Inputs >>i[2] >>i[S0+2]
PE1 <<o[0] <<o[0]

L0 >>w[0] >>w[S0+1]
Outputs >>i[1] >>i[S0+1]
>>w[1] >>w[S0+2]
>>i[2] >>i[S0+2]
>>w[2] >>w[S0+3]
Spatial sum needed? Yes >>i[3] >>i[S0+3]
<<o[1] <<o[1]
37
Assuming S=3
SPATIAL PES WITH SPATIAL SUMMATION

L0 Sum may be
embedded in PE?
Weights
PE0

L0 +
Inputs
PE1

L0
Outputs

What if hardware cannot do a spatial Illegal mapping!

sum?
38
MORE REALISTIC LOOP NEST
int i[W]; # Input activations
int w[R]; # Filter Weights
int o[E]; # Output activations

// Level 2
for (x2 = 0; x2 < X’2; x2++) {
for (s2 = 0; s2 < S2; s2++) {
// Level 1
parallel-for (x1 = 0; x1 < X’1; x1++) {
parallel-for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
...

General approach: alternate temporal/spatial levels and choose values (including 1) carefully
39
“FRACTAL” ACCELERATOR DESIGN
L0
Weights
L1
Weights L0
Inputs
PE

L1 L0
Outputs
Inputs
L0
Weights
L1
Outputs L0
PE
Inputs

L0
Outputs

October 25, 40
MORE REALISTIC PROBLEM: 2D CONVOLUTION

Weights

R Partial Sums
Inputs

Y’ = Y – ceil(R/2)
S
Y

X’ = X – ceil(S/2)
X

41
2-D CONVOLUTION LOOP NEST
int i[Y][X]; # Input activations
int w[R][S]; # Filter weights
int o[Y’][X’]; # Output activations

for (y = 0; y < Y’; y++) {

for (x = 0; x < X’; x++) {
for (r = 0; r < R; r++) {
for (s = 0; s < S; s++) {
o[y][x] += i[y+r][x+s]*w[r][s];
}

What dataflow is this? Output stationary + row major

What new opportunities can 2D allows exploration of rectangular data tile

we exploit? aspect ratios 42
TILED 2-D CONVOLUTION
int i[Y][X]; # Input activations
int w[R][S]; # Filter weights
int o[Y’][X’]; # Output activations
Note:
for (y2 = 0; y2 < Y2’; y2++) {
X’0*X’1*X’2 = X’
for (x2 = 0; x2 < X2’; x2++) { and so on...
for (r2 = 0; r2 < R2; r2++) {
for (s2 = 0; s2 < S2; s2++) {
parallel-for (y1 = 0; y1 < Y1; y1++) {
...

How do you make a square Set X0=Y0, and/or R0=S0

tile?
How do you make a 1D tile? Set X0=width, Y0=1
(e.g., strip mining) or X0=1, Y0=height... 43
CNN PROBLEM FORMULATION – INPUT CHANNELS
Weights
C

Inputs
C Partial Sums
R

Y’ = Y – ceil(R/2)
S
Y

X’ = X – ceil(S/2)
X

44
CNN PROBLEM FORMULATION – INPUT CHANNELS
Weights
C

Inputs
Partial Sums
R

Y’ = Y – ceil(R/2)
S
Y

C
X’ = X – ceil(S/2)
X

45
CNN PROBLEM FORMULATION – OUTPUT CHANNELS
Weights
C

Inputs
K Partial Sums
R

Y’ = Y – ceil(R/2)
K S
Y
C

C
X’ = X – ceil(S/2)
X

46
REFERENCE CONVOLUTIONAL LAYER
int i[C][Y][X]; # Input activation channels
int w[K][C][R][S]; # Filter weights (per channel pair)
int o[K][Y’][X’]; # Output activation channels

for (k = 0; k < K; k++) {

for (y = 0; y < Y’; y++) {
for (x = 0; x < X’; x++) {
for (c = 0; c < C; c++) {
for (r = 0; r < R; r++) {
for (s = 0; s < S; s++) {
o[k][y][x] += i[c][y+r][x+s]*w[k][c][r][s];

What dataflow is this? Output-Channel Output Stationary, row major

(input channel most minor)
What new opportunities can Each input contributes to many planes of outputs
we exploit? 47
TILING A CONVOLUTIONAL LAYER
int i[C][Y][X]; # Input activation channels
int w[K][C][R][S]; # Filter weights (per channel pair)
int o[K][Y’][X’]; # Output activation channels

for (k1 = 0; k1 < K1; k1++) {

for (y1 = 0; y1 < Y1’; y1++) {
for (x1 = 0; x1 < X1’; x1++) {
for (c1 = 0; c1 < C1; c1++) {
for (r1 = 0; r1 < R1; r1++) {
for (s1 = 0; s1 < S1; s1++) {
parallel for (k0 = 0; k0 < K0; k0++) {
...

Gigantic space of potential loop orders and factorizations – how to explore?

• Cycle-accurate modeling of realistic dimensions and fabric sizes too slow
• Solution: use an analytic modeling
48
CONCLUSIONS

Understanding CNN dataflow and tiling concepts is critical for computer architecture
researchers focusing on domain-specific accelerators

Caches are generally considered too expensive (area+power) but we can easily make up the
difference using workload knowledge (more about this later today)

For an example of exploiting advanced dataflow knowledge, see:

UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition -
Kartik Hegde (UIUC), Jiyong Yu (UIUC), Rohit Agrawal (UIUC), Mengjia Yan (UIUC),
Michael Pellauer (NVIDIA), Christopher Fletcher (UIUC)
[ISCA 2018]
49

Sram Part1
No ratings yet
Sram Part1
36 pages
Acquire Original Rules
No ratings yet
Acquire Original Rules
5 pages
Mapping To Hardware: 6.5930/1 Hardware Architectures For Deep Learning
No ratings yet
Mapping To Hardware: 6.5930/1 Hardware Architectures For Deep Learning
60 pages
L20
No ratings yet
L20
35 pages
5_lecture_28_01_25
No ratings yet
5_lecture_28_01_25
47 pages
19 7960 07 Notes
No ratings yet
19 7960 07 Notes
17 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
HPC Unit 5 a
No ratings yet
HPC Unit 5 a
49 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
HPC Unit 5 b
No ratings yet
HPC Unit 5 b
31 pages
Code Generator
No ratings yet
Code Generator
44 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Unit 4
No ratings yet
Unit 4
15 pages
L04-PipeliningII
No ratings yet
L04-PipeliningII
33 pages
An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis
No ratings yet
An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis
63 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
HPC 2025 (1)
No ratings yet
HPC 2025 (1)
16 pages
250114_L3
No ratings yet
250114_L3
58 pages
4_CostModel
No ratings yet
4_CostModel
14 pages
Module-5: Syntax Directed Translation, Intermediate Code Generation, Code Generation 5.1,5.2,5.3, 6.1,6.2,8.1,8.2
No ratings yet
Module-5: Syntax Directed Translation, Intermediate Code Generation, Code Generation 5.1,5.2,5.3, 6.1,6.2,8.1,8.2
37 pages
Compiler Unit 4
No ratings yet
Compiler Unit 4
59 pages
Hls PDF
No ratings yet
Hls PDF
68 pages
Introduction
No ratings yet
Introduction
46 pages
MIT6 172F10 Lec03
No ratings yet
MIT6 172F10 Lec03
75 pages
Web GPU
0% (1)
Web GPU
40 pages
Dataflow For DNN Accelerator Architectures (Part 2)
No ratings yet
Dataflow For DNN Accelerator Architectures (Part 2)
78 pages
2011quiz4sol
No ratings yet
2011quiz4sol
17 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
No ratings yet
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
Project Report: Weight Sharing On CNN
No ratings yet
Project Report: Weight Sharing On CNN
6 pages
278 hw5
No ratings yet
278 hw5
20 pages
Introduction To THE TMS320C6000 Vliw DSP: Prof. Brian L. Evans
No ratings yet
Introduction To THE TMS320C6000 Vliw DSP: Prof. Brian L. Evans
33 pages
MIT6 172F09 Lec02
No ratings yet
MIT6 172F09 Lec02
85 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
Introduction To THE TMS320C6x Vliw DSP: Prof. Brian L. Evans
No ratings yet
Introduction To THE TMS320C6x Vliw DSP: Prof. Brian L. Evans
31 pages
Ble 90
No ratings yet
Ble 90
268 pages
6-Codegen Opti PDF
No ratings yet
6-Codegen Opti PDF
47 pages
250116_L4
No ratings yet
250116_L4
53 pages
Implementing AI Models on FPGAs_ A Comprehensive T
No ratings yet
Implementing AI Models on FPGAs_ A Comprehensive T
43 pages
3 Data Centric Directives
No ratings yet
3 Data Centric Directives
32 pages
Design Pattern For Graph Algorithms
No ratings yet
Design Pattern For Graph Algorithms
72 pages
Design For Performance
100% (1)
Design For Performance
34 pages
Optimization Techniques Code Optimizations
No ratings yet
Optimization Techniques Code Optimizations
10 pages
written_asst2
No ratings yet
written_asst2
27 pages
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
No ratings yet
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
78 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
Annf PDF
No ratings yet
Annf PDF
42 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
CS-30005(HPC)-CS_END_NOV_2024
No ratings yet
CS-30005(HPC)-CS_END_NOV_2024
23 pages
Algo - 1
No ratings yet
Algo - 1
54 pages
WINSEM2024-25_BCSE305L_TH_VL2024250501461_2025-02-12_Reference-Material-I
No ratings yet
WINSEM2024-25_BCSE305L_TH_VL2024250501461_2025-02-12_Reference-Material-I
45 pages
CS398 Exam 3, 2 Chance December 17th, 2012: Circle The Section That Attend (So We Can Hand Back Your Exam)
No ratings yet
CS398 Exam 3, 2 Chance December 17th, 2012: Circle The Section That Attend (So We Can Hand Back Your Exam)
7 pages
Lesson 02 DataFlow Datatypes Loops
No ratings yet
Lesson 02 DataFlow Datatypes Loops
64 pages
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
From Everand
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
Analog Dialogue
No ratings yet
Build Switch and Logic Gates Using Transistors on the Breadboard
From Everand
Build Switch and Logic Gates Using Transistors on the Breadboard
GURUPRASAD N H
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Electric Circuits Essentials
From Everand
Electric Circuits Essentials
The Editors of REA
5/5 (5)
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Ex2100 PDF
No ratings yet
Ex2100 PDF
118 pages
Download Full (Ebook) Cognitive Fatigue: Multidisciplinary Perspectives on Current Research and Future Applications by Phillip L. Ackerman ISBN 9781433808395, 1433808390 PDF All Chapters
No ratings yet
Download Full (Ebook) Cognitive Fatigue: Multidisciplinary Perspectives on Current Research and Future Applications by Phillip L. Ackerman ISBN 9781433808395, 1433808390 PDF All Chapters
76 pages
WTP46a - Fan Applications in Smoke Control Systems
No ratings yet
WTP46a - Fan Applications in Smoke Control Systems
14 pages
Op - Description: Wps No. Bm4135Fw1-1 A Se Folosi Dispozitiv de Sudura
No ratings yet
Op - Description: Wps No. Bm4135Fw1-1 A Se Folosi Dispozitiv de Sudura
69 pages
Bill Payment Guide
No ratings yet
Bill Payment Guide
1 page
DOC-20241130-WA0016.
No ratings yet
DOC-20241130-WA0016.
3 pages
Repetition, Variety or Contrast, Rhythm, Balance, Compositional Unity, Emphasis, Economy, and Proportion
100% (3)
Repetition, Variety or Contrast, Rhythm, Balance, Compositional Unity, Emphasis, Economy, and Proportion
32 pages
Topic: The Washington and Jefferson College Review: Style Sheet For Authors
No ratings yet
Topic: The Washington and Jefferson College Review: Style Sheet For Authors
4 pages
Palma Company Had 90
No ratings yet
Palma Company Had 90
2 pages
Usb 6008
No ratings yet
Usb 6008
5 pages
HR Analytics - Watermark
No ratings yet
HR Analytics - Watermark
23 pages
Passive Design Guidelines For Philippine Housing Based On The Bahay Kubo
No ratings yet
Passive Design Guidelines For Philippine Housing Based On The Bahay Kubo
9 pages
Clean Edge Razor: Case Analysis
No ratings yet
Clean Edge Razor: Case Analysis
7 pages
PLC Networks
No ratings yet
PLC Networks
7 pages
2377-77 - Sample Paper A V2
No ratings yet
2377-77 - Sample Paper A V2
13 pages
Lecture 1 Introduction To Excel
No ratings yet
Lecture 1 Introduction To Excel
17 pages
Boarding Pass
No ratings yet
Boarding Pass
1 page
SQL Lab Exercise For Day 3,4,5
No ratings yet
SQL Lab Exercise For Day 3,4,5
5 pages
Ap Grade 7
100% (1)
Ap Grade 7
5 pages
wsc2018 Programme Web
No ratings yet
wsc2018 Programme Web
221 pages
Surface Dressing . Unedited
No ratings yet
Surface Dressing . Unedited
2 pages
Toolbox Talk 10 Safety Footwear
No ratings yet
Toolbox Talk 10 Safety Footwear
1 page
Cubit Descriptions
No ratings yet
Cubit Descriptions
2 pages
Resume JasonWeingardt
No ratings yet
Resume JasonWeingardt
1 page
1.1 Assignments Plane Mirrors
No ratings yet
1.1 Assignments Plane Mirrors
7 pages
STEP01 Kick-Off Presentation 2023 APJ v2
No ratings yet
STEP01 Kick-Off Presentation 2023 APJ v2
38 pages
Project:-765Kv S/C Raigarh-Champa TL.: Safety Aspects of Conductor Installations IN Transmission Line
No ratings yet
Project:-765Kv S/C Raigarh-Champa TL.: Safety Aspects of Conductor Installations IN Transmission Line
37 pages
Operating System: Operating Systems: Internals and Design Principles
No ratings yet
Operating System: Operating Systems: Internals and Design Principles
86 pages
(Members of Paper Conservators Asia Unlimited) : Naoko Takagi, Yoriko Chudo, Reiko Maeda
No ratings yet
(Members of Paper Conservators Asia Unlimited) : Naoko Takagi, Yoriko Chudo, Reiko Maeda
8 pages

2_DataflowAnalysis

Uploaded by

2_DataflowAnalysis

Uploaded by

INTRODUCTION TO DATAFLOW FOR

DEEP LEARNING ACCELERATOR

*Architecture Research Group, NVIDIA

**Georgia Institute of Technology

Also influenced by: Former Nvidia interns:

Steve Keckler (NVIDIA) Yu-Hsin Chen (MIT)

Motivation and Background

Concepts in Tiling, Blocking, and Scheduling

Extending to full CNN layers

VGG16 conv 3_2

• 7D Computation Space Temporal DRAM Buf RF *

• 4D Operand / Result Spaces – Multicast Forwarding

for (x = 0; x < X’; x++) {

How often does the datapath Every cycle

sums inputs weights

int i[X]; # Input activations

How many times will x+s == 2?

• Temporal distance between re-occurrence dictates buffer size to avoid re-load

int i[X]; # Input activations

* Because we don’t care about where precision/saturation issues

int i[X]; # Input activations

for (s = 0; s < S; s++) {

What dataflow is this? Weight stationary

int i[X]; # Input activations

for (x = 0; x < X; x++) {

Beware x-s must be >= 0 and <X’

Common metric Weights Inputs Outputs / Partial Sums

Common metric Weights Inputs Outputs / Partial Sums

WS = SX’[f(1) + f(X’) + 2f(X’)]

Significant difference in buffer access energy cost based on dataflow

int i[X]; # Input activations

Constant over each level 1 iteration

Disjoint/partitioned reuse pattern

Sliding window/partitioned reuse pattern

Level 0 Level 1 to 0 transfers

Weight Reads SX’ SX’1

Input Reads X’ * S1+ X’1 * S X’+X’1*S

Output Reads N/A N/A

How will this be reflected New ‘level’ of loops

int i[X]; # Input activations Note:

int i[X]; # Input activations

Implementation opportunity? Yes, single fetch and multicast 31

How do we recognize multicast opportunities?

Indices independent of spatial index

int i[W]; # Input activations Note:

int i[X]; # Input activations Note:

How do we handle same index for output Spatial summation…

Other multicast opportunities? No

What if hardware cannot do a spatial Illegal mapping!

for (y = 0; y < Y’; y++) {

What dataflow is this? Output stationary + row major

What new opportunities can 2D allows exploration of rectangular data tile

How do you make a square Set X0=Y0, and/or R0=S0

for (k = 0; k < K; k++) {

What dataflow is this? Output-Channel Output Stationary, row major

for (k1 = 0; k1 < K1; k1++) {

Gigantic space of potential loop orders and factorizations – how to explore?

For an example of exploiting advanced dataflow knowledge, see:

You might also like