Tutorial-on-DNN-09A-Co-design-Sparsity.pdf

1
DNN Model and
Hardware Co-Design
ISCA Tutorial (2019)
Website: http://eyeriss.mit.edu/tutorial.html
Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang

2
Reduce Number of Ops and Weights
• Exploit Activation Statistics
• Exploit Weight Statistics
• Exploit Dot Product Computation
• Decomposed Trained Filters
• Knowledge Distillation

3
Sparsity in Fmaps
9 -1 -3
1 -5 5
-2 6 -1
Many zeros in output fmaps after ReLU
ReLU
9 0 0
1 0 5
0 6 0
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
CONV Layer
# of activations # of non-zero activations
(Normalized)

4
…
…
…
…
…
…
ReL
U
Input Image
Output Image
Filter Filt
Img
Psum
Psum
Buffer
SRAM
108K
B
14×12 PE Array
Link Clock Core Clock
I/O Compression in Eyeriss
Run-Length Compression (RLC)
Example:
Output (64b):
Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, …
5b 16b 1b
5b 16b 5b 16b
2 12 4 53 2 22 0
RunLevelRunLevelRunLevelTerm
Off-Chip DRAM
64 bits
Decomp
Comp
[Chen et al., ISSCC 2016]
DCNN Accelerator

5
Compression Reduces DRAM BW
0
1
2
3
4
5
6
1 2 3 4 5
DRAM
Access
(MB)
AlexNet Conv Layer
Uncompressed Compressed
1 2 3 4 5
AlexNet Conv Layer
DRAM
Access
(MB)
0
2
4
6
1.2×
1.4×
1.7×
1.8×
1.9×
Uncompressed
Fmaps + Weights
RLE Compressed
Fmaps + Weights
Simple RLC within 5% - 10% of theoretical entropy limit

6
Data Gating / Zero Skipping in Eyeriss
Filter
Scratch Pad
(225x16b SRAM)
Partial Sum
Scratch Pad
(24x16b REG)
Filt
Im
g
Input
Psum
2-stage
pipelined
multiplier
Output
Psum
0
Accumulate
Input Psum
1
0
== 0 Zero
Buffer
Enable
Image
Scratch Pad
(12x16b REG)
0
1
Skip MAC and mem reads
when image data is zero.
Reduce PE power by 45%
Reset

7
Cnvlutin
• Process Convolution Layers
• Built on top of DaDianNao (4.49% area overhead)
• Speed up of 1.37x (1.52x with activation pruning)
[Albericio et al., ISCA 2016]

8
Pruning Activations
[Reagen et al., ISCA 2016]
Remove small activation values
[Albericio et al., ISCA 2016]
Speed up 11% (ImageNet) Reduce power 2x (MNIST)
Minerva
Cnvlutin

9
Exploit Correlation in Input Data
• Exploit Temporal Correlation of Inputs
– Reduce amount of computation if there is temporal correlation between
frames
– Requires additional storage and need to measure redundancy (e.g.
motion vector for videos)
– Application specific (e.g. videos) – requires that the same operation is
done for each frame (not always the case)
[Zhang et al., FAST, CVPRW 2017], [EVA2, ISCA 2018],
[Euphrates, ISCA 2018], [Riera et al., ISCA 2018]

10
Exploit Correlation in Input Data
• Exploit Spatial Correlation of Inputs
– Delta code neighboring values (activation) resulting in sparse inputs to
each layer
– Reduces storage cost and data movement for improvement in energy-
efficiency and throughput
[Diffy, MICRO 2018]

11
Pruning – Make Weights Sparse
• Optimal Brain Damage
1. Choose a reasonable network
architecture
2. Train network until reasonable
solution obtained
3. Compute the second derivative
for each weight
4. Compute saliencies (i.e. impact
on training error) for each weight
5. Sort weights by saliency and
delete low-saliency weights
6. Iterate to step 2
[Lecun et al., NeurIPS 1989]
retraining

12
Pruning – Make Weights Sparse
pruning
neurons
pruning
synapses
after pruning
before pruning
Prune based on magnitude of weights
Train Connectivity
Prune Connections
Train Weights
Example: AlexNet
Weight Reduction: CONV layers 2.7x, FC layers 9.9x
(Most reduction on fully connected layers)
Overall: 9x weight reduction, 3x MAC reduction
[Han et al., NeurIPS 2015]

13
Speed up of Weight Pruning on CPU/GPU
Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
Batch size = 1
On Fully Connected Layers Only
Average Speed up of 3.2x on GPU, 3x on CPU, 5x on mGPU
[Han et al., NeurIPS 2015]

14
Design of Efficient DNN Algorithms
• Popular efficient DNN algorithm approaches
pruning
neurons
pruning
synapses
after pruning
before pruning
Network Pruning
C
1
1
S
R
1
R
S
C
Compact Network Architectures
Examples: SqueezeNet, MobileNet
... also reduced precision
• Focus on reducing number of MACs and weights
• Does it translate to energy savings and reduced latency?

15
Energy-Evaluation Methodology
CNN Shape Configuration
(# of channels, # of filters, etc.)
CNN Weights and Input Data
[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
CNN Energy Consumption
L1 L2 L3
Energy
…
Memory
Accesses
Optimization
# of MACs
Calculation
…
# acc. at mem. level 1
# acc. at mem. level 2
# acc. at mem. level n
# of MACs
Hardware Energy Costs of each
MAC and Memory Access
Ecomp
Edata
Evaluation tool available at http://eyeriss.mit.edu/energy.html

16
Key Observations
• Number of weights alone is not a good metric for energy
• All data types should be considered
Output Feature Map
43%
Input Feature Map
25%
Weights
22%
Computation
10%
Energy Consumption
of GoogLeNet
[Yang et al., CVPR 2017]

17
Energy-Aware Pruning
Directly target energy and
incorporate it into the
optimization of DNNs to
provide greater energy savings
• Sort layers based on energy and
prune layers that consume most
energy first
• EAP reduces AlexNet energy by
3.7x and outperforms the
previous work that uses
magnitude-based pruning by 1.7x
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Ori. DC EAP
Normalized Energy (AlexNet)
2.1x
3.7x
x109
Magnitude
Based Pruning
Energy Aware
Pruning
Pruned models available at
http://eyeriss.mit.edu/energy.html
[Yang et al., CVPR 2017]

18
# of Operations vs. Latency
• # of operations (MACs) does not approximate latency well
Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)

19
NetAdapt: Platform-Aware DNN Adaptation
• Automatically adapt DNN to a mobile platform to reach a
target latency or energy budget
• Use empirical measurements to guide optimization (avoid
modeling of tool chain or platform architecture)
[Yang et al., ECCV 2018]
NetAdapt Measure
…
Network Proposals
Empirical Measurements
Metric Proposal A … Proposal Z
Latency 15.6 … 14.3
Energy 41 … 46
…
…
…
Pretrained
Network Metric Budget
Latency 3.8
Energy 10.5
Budget
Adapted
Network
…
…
Platform
A B C D Z
Code to be released at http://netadapt.mit.edu

20
Improved Latency vs. Accuracy Tradeoff
• NetAdapt boosts the real inference speed of MobileNet
by up to 1.7x with higher accuracy
+0.3% accuracy
1.7x faster
+0.3% accuracy
1.6x faster
Reference:
MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017
MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018
*Tested on the ImageNet dataset and a Google Pixel 1 CPU
[Yang et al., ECCV 2018]

21
Compression of Weights & Activations
• Compress weights and activations between DRAM
and accelerator
• Variable Length / Huffman Coding
• Tested on AlexNet à 2× overall BW Reduction
[Moons et al., VLSI 2016; Han et al., ICLR 2016]
Value: 16’b0 à Compressed Code: {1’b0}
Value: 16’bx à Compressed Code: {1’b1, 16’bx}
Example:

22
Compression Overhead
Index (non-zero position info – e.g., IA and JA for CSR) accounts for
approximately half of storage for fine grained pruning
[Han et al., ICLR 2016]

23
Coarse-Grained Pruning
…
E
output fmap
…
…
many
filters (M)
Many
Output Channels (M)
M
…
R
S
1
R
S
…
…
…
C
…
M
H
input fmap
…
…
…
…
C
…
C
…
…
…
W F
May prune by eliminating entire filter
planes or extremely sparse input
activation planes or just a tile of either.

24
Structured/Coarse-Grained Pruning
• Scalpel
– Prune to match the underlying data-parallel hardware
organization for speed up
[Yu et al., ISCA 2017]
Dense weights Sparse weights
Example: 2-way SIMD

25
Exploit Redundant Weights
• Preprocessing to reorder weights (ok since weights are
known)
• Perform addition before multiplication to reduce number
of multiplies and reads of weights
• Example: Input = [1 2 3 ] and filter [A B A]
Typical processing: Output = A*1+B*2+A*3
If reorder as [A A B]: Output = A*(1+3)+B*1
3 multiplies and 3 weight reads
2 multiplies and 2 weight reads
Note: Bitwidth of multiplication may need to increase
[UCNN, ISCA 2018]

26
Exploit ReLU
• Reduce number operations when if resulting activation will be
negative as ReLU will set to zero
• Need to either perform preprocess (sort weights) or minimize
prediction overhead and error
[PredictiveNet, ISCAS 2017], [SnaPEA, ISCA 2018], [Song et al., ISCA 2018]

27
Compact Network Architectures
• Break large convolutional layers into a series of
smaller convolutional layers
– Fewer weights, but same effective receptive field
• Before Training: Network Architecture Design
(already discussed this morning; e.g., MobileNet)
• After Training: Decompose Trained Filters

28
Decompose Trained Filters
After training, perform low-rank approximation by applying tensor
decomposition to weight kernel; then fine-tune weights for accuracy
[Lebedev et al., ICLR 2015]
R = canonical rank

29
Decompose Trained Filters
[Denton et al., NeurIPS 2014]
• Speed up by 1.6 – 2.7x on CPU/GPU for CONV1,
CONV2 layers
• Reduce size by 5 - 13x for FC layer
• < 1% drop in accuracy
Original Approx.
Visualization of Filters

30
Decompose Trained Filters on Phone
[Kim et al., ICLR 2016]
Tucker Decomposition

31
Knowledge Distillation
[Bucilu et al., KDD 2006],[Hinton et al., arXiv 2015]
Complex
DNN B
(teacher)
Simple DNN
(student)
softmax
softmax
Complex
DNN A
(teacher)
softmax
scores
class
probabilities
Try to match

Tutorial-on-DNN-09A-Co-design-Sparsity.pdf

More Related Content

Similar to Tutorial-on-DNN-09A-Co-design-Sparsity.pdf (20)

Recently uploaded (20)

Tutorial-on-DNN-09A-Co-design-Sparsity.pdf