2_DataflowAnalysis
2_DataflowAnalysis
Many slides from: Professors Joel Emer (NVIDIA, MIT) and Vivienne Sze’s (MIT) class
• 6.S082/6.888 Hardware Architectures for Deep Learning (Fall, 2017)
Also includes slides from: Angshuman Parashar, Senior Research Scientist (NVIDIA)
3
ACCELERATORS ARE GREAT.... BUT!
Custom
Datapath
Off-Chip
Memory
4
MOTIVATION: DATA MOVEMENT
Energy costs
8-bit Integer Multiply 1x
Why it’s important
Fetch two 8-bit operands from DRAM ~100x
Fetch two 8-bit operands from large SRAM ~10x
5
MAPPING REUSE TO HARDWARE
7-dimensional network layer 2D hardware array
Weights Inputs
C K Outputs PE PE PE PE
R PE PE PE PE
Y Y’
map
K S PE PE PE PE
C C
X X’
. . PE PE PE PE
. N . N
. .
Algorithmic Hardware
Reuse Reuse
7
PEDAGOGICAL EXAMPLE: 1-D CONVOLUTION
Weights Inputs Outputs*
* =
S X X’ = X-ceil(S/2)† * The terms “Output”
and “Partial Sum”
int i[X]; # Input activations used interchangeably
†Assuming: ‘valid’
int w[S]; # Filter weights
int o[X’]; # Output activations style convolution
Bandwidth In
Note: the
2 “stationary” name is meant to give intuition, not to be a
complete specification of all the behavior of a dataflow
0
Time
* =
S X X’ = X-ceil(S/2)
* =
S X X’ = X-ceil(S/2)
12
MORE DATAFLOWS
Weights Inputs Outputs
* =
S X X’ = X-ceil(S/2)†
How can we
int i[X]; # Input activations implement
int w[S]; # Filter weights input
int o[X’]; # Output activations stationary
with no input
for (x = 0; x < X’; x++) { index?
for (s = 0; s < S; s++) {
o[x] += i[x+s]*w[s];
}
13
INPUT STATIONARY
Weights Inputs Outputs
* =
S X X’ = X-ceil(S/2)
Weights
Inputs
×
Partial Sums
Single Multiplier
sequentially
Local Buffers computes all partial
sums
Backing Store
15
Weights
1D CONVOLUTION – SUMMARY Inputs Outputs
=
S
* X X’
17
MULTI-LAYER BUFFERING
L1
L0 L0
Weights Weights
L1
L0 L0 Data
Inputs Inputs path
L0
L1 L0
Outputs Outputs
18
1-D CONVOLUTION – BUFFERED
Weights Inputs Outputs
* =
S X X’ = X-ceil(X/2)
Energy of a buffer access is a function of the size of For level L>0 there are three components:
the buffer
Data arriving from level L+1
Each buffer level’s energy is proportional the
Data that needs to be transferred to
number of accesses at that level
level L-1
For level 0 that is all the operands to the
Datapath Data that is returned from level L-1
20
MAPPING - WEIGHT ACCESS COSTS
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0] * w[s1*S’0+s0];
}
Weights
s1=0
S0
s1++
S0
s1++
S0
Next x1 iteration
s1=0
21
S0
MAPPING - WEIGHT ACCESS COSTS
Level 0 reads
Per level 1 iteration -> X’0*S0 weight reads
Times X’1*S1 level 1 iterations
Total reads = (X’0*S0)*(X’1*S1) = (X’0*X’1)*(S0*S1) = SX’ reads
Level 1 to 0 transfers
Per level 1 iteration -> S0 weights transferred
Times same number of level 1 iterations = X’1 * S1
Total transfers -> S0*(X’1*S1) = X’1*(S0*S1) = SX’1
Inputs
s1=0
X’0+S0
s1++
S0 X’0+S0 Sliding window
s1++
S0 X’0+S0
Next x1 iteration Input halo!
s1=0
x0 X’0+S0 23
S
MAPPING - INPUT ACCESS COSTS
Level 0 reads
Per level 1 iteration -> X’0+S0 inputs reads
Times X’1*S1 level 1 iterations
Total reads = X’1*S1*(X’0+S0) = ((X’1*X’0)*S1)+(X’1*(S1*S0)) = X’*S1+X’1*S reads
Level 1 to 0 transfers
For s=0, X’0+S0 inputs transferred
For each of the following S1-1 iterations another S0 inputs transferred
So total per x1 iteration is: X’0+S0*S1 = X’0+S inputs
Times number of x1 iterations = X’1
So total transfers = X’1*(X’0+S) = (X’1*X’0)+X’1*S = X’+X’1*S
Outputs
s1=0
X’0
s1++
X’0
Next x1 iteration
s1=0
25
X’0
MAPPING - OUTPUT ACCESS COSTS
Level 0 writes
Due to level 0 being ‘output stationary’ only X’0 writes per level 1 iteration
Times X’1*S1 level 1 iterations
Total writes = X’0*(X’1*S1) = (X’0*X’1)*S1 = X’*S1 writes
Level 0 to 1 transfers
After each S1 iterations a completed partial sum for X’0 outputs are transferred
There are X’1 chunks of S1 iterations
So total is X’1*X’0 = X’ transfers
October 25, 26
MAPPING DATA COST SUMMARY
// Level 1
for (x1 = 0; x1 < X’1; x1++) {
for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]* w[s1*S0+s0];
}
Output Writes X’ * S1 X’
27
SPATIAL PES
L0
Weights
PE0
L0
Inputs
PE1
L0
Outputs
* =
S X X’ = X-ceil(S/2)
// Level 1 X’1 = 2
parallel-for (x1 = 0; x1 < X’1; x1++) {
parallel-for (s1 = 0; s1 < S1; s1++) { S1 = 1 => s1 = 0
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
29
1-D CONVOLUTION – SPATIAL
Weights Inputs Outputs
* =
S X X’ = X-ceil(S/2)
// Level 1
parallel-for (x1 = 0; x1 < 2; x1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
30
SPATIAL PES
L0 PE0 PE1
>>w[0] >>w[0]
Weights
PE0
L0
Inputs
PE1
L0
Outputs
// Level 1
parallel-for (x1 = 0; x1 < 2; x1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
October 25, 32
SPATIAL PES: PARTITIONED INPUTS
L0 PE0 PE1
>>w[0] >>w[0]
Weights >>i[0] >>i[X’0+0]
PE0 >>w[1] >>w[1]
>>i[1] >>i[X’0+1]
L0 >>w[2] >>w[2]
Inputs >>i[2] >>i[X’0+2]
PE1 <<o[0] <<o[X’0+0]
L0 >>w[0] >>w[0]
Outputs >>i[1] >>i[X’0+1]
>>w[1] >>w[1]
>>i[2] >>i[X’0+2]
>>w[2] >>w[2]
Implementation opportunity? Parallel fetch >>i[3] >>i[X’0+3]
<<o[1] <<o[X’0+1]
33
Assuming S=3
SPATIAL PARTITIONING WEIGHTS
Weights Inputs Outputs
* =
S X X’ = X-ceil(S/2)
// Level 1
parallel-for (x1 = 0; x1 < X’1; x1++) { X’1 = 1 => x1 = 0
parallel-for (s1 = 0; s1 < S1; s1++) { S0 = 1, S1 = 2
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
34
SPATIAL PARTITIONING WEIGHTS
Weights Inputs Outputs
* =
S X X’ = X-ceil(S/2)
// Level 1
parallel-for (s1 = 0; s1 < 2; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
35
1-D CONVOLUTION – SPATIAL
// Level 1
parallel-for (s1 = 0; s1 < 2; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
o[x1*X’0+x0] += i[x1*X’0+x0 + s1*S0+s0]
* w[s1*S0+s0];
}
L0 >>w[0] >>w[S0+1]
Outputs >>i[1] >>i[S0+1]
>>w[1] >>w[S0+2]
>>i[2] >>i[S0+2]
>>w[2] >>w[S0+3]
Spatial sum needed? Yes >>i[3] >>i[S0+3]
<<o[1] <<o[1]
37
Assuming S=3
SPATIAL PES WITH SPATIAL SUMMATION
L0 Sum may be
embedded in PE?
Weights
PE0
L0 +
Inputs
PE1
L0
Outputs
// Level 2
for (x2 = 0; x2 < X’2; x2++) {
for (s2 = 0; s2 < S2; s2++) {
// Level 1
parallel-for (x1 = 0; x1 < X’1; x1++) {
parallel-for (s1 = 0; s1 < S1; s1++) {
// Level 0
for (x0 = 0; x0 < X’0; x0++) {
for (s0 = 0; s0 < S0; s0++) {
...
General approach: alternate temporal/spatial levels and choose values (including 1) carefully
39
“FRACTAL” ACCELERATOR DESIGN
L0
Weights
L1
Weights L0
Inputs
PE
L1 L0
Outputs
Inputs
L0
Weights
L1
Outputs L0
PE
Inputs
L0
Outputs
October 25, 40
MORE REALISTIC PROBLEM: 2D CONVOLUTION
Weights
R Partial Sums
Inputs
Y’ = Y – ceil(R/2)
S
Y
X’ = X – ceil(S/2)
X
41
2-D CONVOLUTION LOOP NEST
int i[Y][X]; # Input activations
int w[R][S]; # Filter weights
int o[Y’][X’]; # Output activations
Inputs
C Partial Sums
R
Y’ = Y – ceil(R/2)
S
Y
X’ = X – ceil(S/2)
X
44
CNN PROBLEM FORMULATION – INPUT CHANNELS
Weights
C
Inputs
Partial Sums
R
Y’ = Y – ceil(R/2)
S
Y
C
X’ = X – ceil(S/2)
X
45
CNN PROBLEM FORMULATION – OUTPUT CHANNELS
Weights
C
Inputs
K Partial Sums
R
Y’ = Y – ceil(R/2)
K S
Y
C
C
X’ = X – ceil(S/2)
X
46
REFERENCE CONVOLUTIONAL LAYER
int i[C][Y][X]; # Input activation channels
int w[K][C][R][S]; # Filter weights (per channel pair)
int o[K][Y’][X’]; # Output activation channels
Understanding CNN dataflow and tiling concepts is critical for computer architecture
researchers focusing on domain-specific accelerators
Caches are generally considered too expensive (area+power) but we can easily make up the
difference using workload knowledge (more about this later today)