SlideShare a Scribd company logo
1
DNN Model and
Hardware Co-Design
ISCA Tutorial (2019)
Website: http://eyeriss.mit.edu/tutorial.html
Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang
2
Reduce Number of Ops and Weights
• Exploit Activation Statistics
• Exploit Weight Statistics
• Exploit Dot Product Computation
• Decomposed Trained Filters
• Knowledge Distillation
3
Sparsity in Fmaps
9 -1 -3
1 -5 5
-2 6 -1
Many zeros in output fmaps after ReLU
ReLU
9 0 0
1 0 5
0 6 0
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
CONV Layer
# of activations # of non-zero activations
(Normalized)
4
…
…
…
…
…
…
ReL
U
Input Image
Output Image
Filter Filt
Img
Psum
Psum
Buffer
SRAM
108K
B
14×12 PE Array
Link Clock Core Clock
I/O Compression in Eyeriss
Run-Length Compression (RLC)
Example:
Output (64b):
Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, …
5b 16b 1b
5b 16b 5b 16b
2 12 4 53 2 22 0
RunLevelRunLevelRunLevelTerm
Off-Chip DRAM
64 bits
Decomp
Comp
[Chen et al., ISSCC 2016]
DCNN Accelerator
5
Compression Reduces DRAM BW
0
1
2
3
4
5
6
1 2 3 4 5
DRAM
Access
(MB)
AlexNet Conv Layer
Uncompressed Compressed
1 2 3 4 5
AlexNet Conv Layer
DRAM
Access
(MB)
0
2
4
6
1.2×
1.4×
1.7×
1.8×
1.9×
Uncompressed
Fmaps + Weights
RLE Compressed
Fmaps + Weights
[Chen et al., ISSCC 2016]
Simple RLC within 5% - 10% of theoretical entropy limit
6
Data Gating / Zero Skipping in Eyeriss
Filter
Scratch Pad
(225x16b SRAM)
Partial Sum
Scratch Pad
(24x16b REG)
Filt
Im
g
Input
Psum
2-stage
pipelined
multiplier
Output
Psum
0
Accumulate
Input Psum
1
0
== 0 Zero
Buffer
Enable
Image
Scratch Pad
(12x16b REG)
0
1
Skip MAC and mem reads
when image data is zero.
Reduce PE power by 45%
Reset
[Chen et al., ISSCC 2016]
7
Cnvlutin
• Process Convolution Layers
• Built on top of DaDianNao (4.49% area overhead)
• Speed up of 1.37x (1.52x with activation pruning)
[Albericio et al., ISCA 2016]
8
Pruning Activations
[Reagen et al., ISCA 2016]
Remove small activation values
[Albericio et al., ISCA 2016]
Speed up 11% (ImageNet) Reduce power 2x (MNIST)
Minerva
Cnvlutin
9
Exploit Correlation in Input Data
• Exploit Temporal Correlation of Inputs
– Reduce amount of computation if there is temporal correlation between
frames
– Requires additional storage and need to measure redundancy (e.g.
motion vector for videos)
– Application specific (e.g. videos) – requires that the same operation is
done for each frame (not always the case)
[Zhang et al., FAST, CVPRW 2017], [EVA2, ISCA 2018],
[Euphrates, ISCA 2018], [Riera et al., ISCA 2018]
10
Exploit Correlation in Input Data
• Exploit Spatial Correlation of Inputs
– Delta code neighboring values (activation) resulting in sparse inputs to
each layer
– Reduces storage cost and data movement for improvement in energy-
efficiency and throughput
[Diffy, MICRO 2018]
11
Pruning – Make Weights Sparse
• Optimal Brain Damage
1. Choose a reasonable network
architecture
2. Train network until reasonable
solution obtained
3. Compute the second derivative
for each weight
4. Compute saliencies (i.e. impact
on training error) for each weight
5. Sort weights by saliency and
delete low-saliency weights
6. Iterate to step 2
[Lecun et al., NeurIPS 1989]
retraining
12
Pruning – Make Weights Sparse
pruning
neurons
pruning
synapses
after pruning
before pruning
Prune based on magnitude of weights
Train Connectivity
Prune Connections
Train Weights
Example: AlexNet
Weight Reduction: CONV layers 2.7x, FC layers 9.9x
(Most reduction on fully connected layers)
Overall: 9x weight reduction, 3x MAC reduction
[Han et al., NeurIPS 2015]
13
Speed up of Weight Pruning on CPU/GPU
Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
Batch size = 1
On Fully Connected Layers Only
Average Speed up of 3.2x on GPU, 3x on CPU, 5x on mGPU
[Han et al., NeurIPS 2015]
14
Design of Efficient DNN Algorithms
• Popular efficient DNN algorithm approaches
pruning
neurons
pruning
synapses
after pruning
before pruning
Network Pruning
C
1
1
S
R
1
R
S
C
Compact Network Architectures
Examples: SqueezeNet, MobileNet
... also reduced precision
• Focus on reducing number of MACs and weights
• Does it translate to energy savings and reduced latency?
15
Energy-Evaluation Methodology
CNN Shape Configuration
(# of channels, # of filters, etc.)
CNN Weights and Input Data
[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
CNN Energy Consumption
L1 L2 L3
Energy
…
Memory
Accesses
Optimization
# of MACs
Calculation
…
# acc. at mem. level 1
# acc. at mem. level 2
# acc. at mem. level n
# of MACs
Hardware Energy Costs of each
MAC and Memory Access
Ecomp
Edata
Evaluation tool available at http://eyeriss.mit.edu/energy.html
16
Key Observations
• Number of weights alone is not a good metric for energy
• All data types should be considered
Output Feature Map
43%
Input Feature Map
25%
Weights
22%
Computation
10%
Energy Consumption
of GoogLeNet
[Yang et al., CVPR 2017]
17
Energy-Aware Pruning
Directly target energy and
incorporate it into the
optimization of DNNs to
provide greater energy savings
• Sort layers based on energy and
prune layers that consume most
energy first
• EAP reduces AlexNet energy by
3.7x and outperforms the
previous work that uses
magnitude-based pruning by 1.7x
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Ori. DC EAP
Normalized Energy (AlexNet)
2.1x
3.7x
x109
Magnitude
Based Pruning
Energy Aware
Pruning
Pruned models available at
http://eyeriss.mit.edu/energy.html
[Yang et al., CVPR 2017]
18
# of Operations vs. Latency
• # of operations (MACs) does not approximate latency well
Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)
19
NetAdapt: Platform-Aware DNN Adaptation
• Automatically adapt DNN to a mobile platform to reach a
target latency or energy budget
• Use empirical measurements to guide optimization (avoid
modeling of tool chain or platform architecture)
[Yang et al., ECCV 2018]
NetAdapt Measure
…
Network Proposals
Empirical Measurements
Metric Proposal A … Proposal Z
Latency 15.6 … 14.3
Energy 41 … 46
…
…
…
Pretrained
Network Metric Budget
Latency 3.8
Energy 10.5
Budget
Adapted
Network
…
…
Platform
A B C D Z
Code to be released at http://netadapt.mit.edu
20
Improved Latency vs. Accuracy Tradeoff
• NetAdapt boosts the real inference speed of MobileNet
by up to 1.7x with higher accuracy
+0.3% accuracy
1.7x faster
+0.3% accuracy
1.6x faster
Reference:
MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017
MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018
*Tested on the ImageNet dataset and a Google Pixel 1 CPU
[Yang et al., ECCV 2018]
21
Compression of Weights & Activations
• Compress weights and activations between DRAM
and accelerator
• Variable Length / Huffman Coding
• Tested on AlexNet à 2× overall BW Reduction
[Moons et al., VLSI 2016; Han et al., ICLR 2016]
Value: 16’b0 à Compressed Code: {1’b0}
Value: 16’bx à Compressed Code: {1’b1, 16’bx}
Example:
22
Compression Overhead
Index (non-zero position info – e.g., IA and JA for CSR) accounts for
approximately half of storage for fine grained pruning
[Han et al., ICLR 2016]
23
Coarse-Grained Pruning
…
E
output fmap
…
…
many
filters (M)
Many
Output Channels (M)
M
…
R
S
1
R
S
…
…
…
C
…
M
H
input fmap
…
…
…
…
C
…
C
…
…
…
W F
May prune by eliminating entire filter
planes or extremely sparse input
activation planes or just a tile of either.
24
Structured/Coarse-Grained Pruning
• Scalpel
– Prune to match the underlying data-parallel hardware
organization for speed up
[Yu et al., ISCA 2017]
Dense weights Sparse weights
Example: 2-way SIMD
25
Exploit Redundant Weights
• Preprocessing to reorder weights (ok since weights are
known)
• Perform addition before multiplication to reduce number
of multiplies and reads of weights
• Example: Input = [1 2 3 ] and filter [A B A]
Typical processing: Output = A*1+B*2+A*3
If reorder as [A A B]: Output = A*(1+3)+B*1
3 multiplies and 3 weight reads
2 multiplies and 2 weight reads
Note: Bitwidth of multiplication may need to increase
[UCNN, ISCA 2018]
26
Exploit ReLU
• Reduce number operations when if resulting activation will be
negative as ReLU will set to zero
• Need to either perform preprocess (sort weights) or minimize
prediction overhead and error
[PredictiveNet, ISCAS 2017], [SnaPEA, ISCA 2018], [Song et al., ISCA 2018]
27
Compact Network Architectures
• Break large convolutional layers into a series of
smaller convolutional layers
– Fewer weights, but same effective receptive field
• Before Training: Network Architecture Design
(already discussed this morning; e.g., MobileNet)
• After Training: Decompose Trained Filters
28
Decompose Trained Filters
After training, perform low-rank approximation by applying tensor
decomposition to weight kernel; then fine-tune weights for accuracy
[Lebedev et al., ICLR 2015]
R = canonical rank
29
Decompose Trained Filters
[Denton et al., NeurIPS 2014]
• Speed up by 1.6 – 2.7x on CPU/GPU for CONV1,
CONV2 layers
• Reduce size by 5 - 13x for FC layer
• < 1% drop in accuracy
Original Approx.
Visualization of Filters
30
Decompose Trained Filters on Phone
[Kim et al., ICLR 2016]
Tucker Decomposition
31
Knowledge Distillation
[Bucilu et al., KDD 2006],[Hinton et al., arXiv 2015]
Complex
DNN B
(teacher)
Simple DNN
(student)
softmax
softmax
Complex
DNN A
(teacher)
softmax
scores
class
probabilities
Try to match

More Related Content

PDF
DLD meetup 2017, Efficient Deep Learning
PDF
PPTX
Deep Learning in Low Power Devices
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PDF
Efficient_DNN_pruning_System_for_machine_learning.pdf
PDF
Tutorial-on-DNN-07-Co-design-Precision.pdf
PDF
1801.06434
PDF
2019-06-14:7 - Neutral Network Compression
DLD meetup 2017, Efficient Deep Learning
Deep Learning in Low Power Devices
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
Efficient_DNN_pruning_System_for_machine_learning.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdf
1801.06434
2019-06-14:7 - Neutral Network Compression

Similar to Tutorial-on-DNN-09A-Co-design-Sparsity.pdf (20)

PDF
"Enabling Automated Design of Computationally Efficient Deep Neural Networks,...
PDF
Architecture Design for Deep Neural Networks I
PDF
Convolution Neural Networks and Ensembles for Visually Impaired Aid.pdf
PDF
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
PDF
201907 AutoML and Neural Architecture Search
PDF
Improving Hardware Efficiency for DNN Applications
PDF
Compressing Neural Networks with Intel AI Lab's Distiller
PDF
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
PDF
Hardware Acceleration for Machine Learning
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PPTX
Automatic Attendace using convolutional neural network Face Recognition
PDF
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
PDF
Accelerating algorithmic and hardware advancements for power efficient on-dev...
PPTX
Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017
PPTX
Beyond data and model parallelism for deep neural networks
PDF
AI and Deep Learning
PDF
Open source ai_technical_trend
PDF
Siddha Ganju, NVIDIA. Deep Learning for Mobile
PDF
Siddha Ganju. Deep learning on mobile
PPTX
Deep Learning Fundamentals
"Enabling Automated Design of Computationally Efficient Deep Neural Networks,...
Architecture Design for Deep Neural Networks I
Convolution Neural Networks and Ensembles for Visually Impaired Aid.pdf
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
201907 AutoML and Neural Architecture Search
Improving Hardware Efficiency for DNN Applications
Compressing Neural Networks with Intel AI Lab's Distiller
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
Hardware Acceleration for Machine Learning
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Automatic Attendace using convolutional neural network Face Recognition
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
Accelerating algorithmic and hardware advancements for power efficient on-dev...
Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017
Beyond data and model parallelism for deep neural networks
AI and Deep Learning
Open source ai_technical_trend
Siddha Ganju, NVIDIA. Deep Learning for Mobile
Siddha Ganju. Deep learning on mobile
Deep Learning Fundamentals
Ad

Recently uploaded (20)

PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
573137875-Attendance-Management-System-original
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
web development for engineering and engineering
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Geodesy 1.pptx...............................................
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
Arduino robotics embedded978-1-4302-3184-4.pdf
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
573137875-Attendance-Management-System-original
Strings in CPP - Strings in C++ are sequences of characters used to store and...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
web development for engineering and engineering
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Geodesy 1.pptx...............................................
bas. eng. economics group 4 presentation 1.pptx
Sustainable Sites - Green Building Construction
Embodied AI: Ushering in the Next Era of Intelligent Systems
Model Code of Practice - Construction Work - 21102022 .pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Lesson 3_Tessellation.pptx finite Mathematics
Ad

Tutorial-on-DNN-09A-Co-design-Sparsity.pdf

  • 1. 1 DNN Model and Hardware Co-Design ISCA Tutorial (2019) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang
  • 2. 2 Reduce Number of Ops and Weights • Exploit Activation Statistics • Exploit Weight Statistics • Exploit Dot Product Computation • Decomposed Trained Filters • Knowledge Distillation
  • 3. 3 Sparsity in Fmaps 9 -1 -3 1 -5 5 -2 6 -1 Many zeros in output fmaps after ReLU ReLU 9 0 0 1 0 5 0 6 0 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 CONV Layer # of activations # of non-zero activations (Normalized)
  • 4. 4 … … … … … … ReL U Input Image Output Image Filter Filt Img Psum Psum Buffer SRAM 108K B 14×12 PE Array Link Clock Core Clock I/O Compression in Eyeriss Run-Length Compression (RLC) Example: Output (64b): Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, … 5b 16b 1b 5b 16b 5b 16b 2 12 4 53 2 22 0 RunLevelRunLevelRunLevelTerm Off-Chip DRAM 64 bits Decomp Comp [Chen et al., ISSCC 2016] DCNN Accelerator
  • 5. 5 Compression Reduces DRAM BW 0 1 2 3 4 5 6 1 2 3 4 5 DRAM Access (MB) AlexNet Conv Layer Uncompressed Compressed 1 2 3 4 5 AlexNet Conv Layer DRAM Access (MB) 0 2 4 6 1.2× 1.4× 1.7× 1.8× 1.9× Uncompressed Fmaps + Weights RLE Compressed Fmaps + Weights [Chen et al., ISSCC 2016] Simple RLC within 5% - 10% of theoretical entropy limit
  • 6. 6 Data Gating / Zero Skipping in Eyeriss Filter Scratch Pad (225x16b SRAM) Partial Sum Scratch Pad (24x16b REG) Filt Im g Input Psum 2-stage pipelined multiplier Output Psum 0 Accumulate Input Psum 1 0 == 0 Zero Buffer Enable Image Scratch Pad (12x16b REG) 0 1 Skip MAC and mem reads when image data is zero. Reduce PE power by 45% Reset [Chen et al., ISSCC 2016]
  • 7. 7 Cnvlutin • Process Convolution Layers • Built on top of DaDianNao (4.49% area overhead) • Speed up of 1.37x (1.52x with activation pruning) [Albericio et al., ISCA 2016]
  • 8. 8 Pruning Activations [Reagen et al., ISCA 2016] Remove small activation values [Albericio et al., ISCA 2016] Speed up 11% (ImageNet) Reduce power 2x (MNIST) Minerva Cnvlutin
  • 9. 9 Exploit Correlation in Input Data • Exploit Temporal Correlation of Inputs – Reduce amount of computation if there is temporal correlation between frames – Requires additional storage and need to measure redundancy (e.g. motion vector for videos) – Application specific (e.g. videos) – requires that the same operation is done for each frame (not always the case) [Zhang et al., FAST, CVPRW 2017], [EVA2, ISCA 2018], [Euphrates, ISCA 2018], [Riera et al., ISCA 2018]
  • 10. 10 Exploit Correlation in Input Data • Exploit Spatial Correlation of Inputs – Delta code neighboring values (activation) resulting in sparse inputs to each layer – Reduces storage cost and data movement for improvement in energy- efficiency and throughput [Diffy, MICRO 2018]
  • 11. 11 Pruning – Make Weights Sparse • Optimal Brain Damage 1. Choose a reasonable network architecture 2. Train network until reasonable solution obtained 3. Compute the second derivative for each weight 4. Compute saliencies (i.e. impact on training error) for each weight 5. Sort weights by saliency and delete low-saliency weights 6. Iterate to step 2 [Lecun et al., NeurIPS 1989] retraining
  • 12. 12 Pruning – Make Weights Sparse pruning neurons pruning synapses after pruning before pruning Prune based on magnitude of weights Train Connectivity Prune Connections Train Weights Example: AlexNet Weight Reduction: CONV layers 2.7x, FC layers 9.9x (Most reduction on fully connected layers) Overall: 9x weight reduction, 3x MAC reduction [Han et al., NeurIPS 2015]
  • 13. 13 Speed up of Weight Pruning on CPU/GPU Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV Batch size = 1 On Fully Connected Layers Only Average Speed up of 3.2x on GPU, 3x on CPU, 5x on mGPU [Han et al., NeurIPS 2015]
  • 14. 14 Design of Efficient DNN Algorithms • Popular efficient DNN algorithm approaches pruning neurons pruning synapses after pruning before pruning Network Pruning C 1 1 S R 1 R S C Compact Network Architectures Examples: SqueezeNet, MobileNet ... also reduced precision • Focus on reducing number of MACs and weights • Does it translate to energy savings and reduced latency?
  • 15. 15 Energy-Evaluation Methodology CNN Shape Configuration (# of channels, # of filters, etc.) CNN Weights and Input Data [0.3, 0, -0.4, 0.7, 0, 0, 0.1, …] CNN Energy Consumption L1 L2 L3 Energy … Memory Accesses Optimization # of MACs Calculation … # acc. at mem. level 1 # acc. at mem. level 2 # acc. at mem. level n # of MACs Hardware Energy Costs of each MAC and Memory Access Ecomp Edata Evaluation tool available at http://eyeriss.mit.edu/energy.html
  • 16. 16 Key Observations • Number of weights alone is not a good metric for energy • All data types should be considered Output Feature Map 43% Input Feature Map 25% Weights 22% Computation 10% Energy Consumption of GoogLeNet [Yang et al., CVPR 2017]
  • 17. 17 Energy-Aware Pruning Directly target energy and incorporate it into the optimization of DNNs to provide greater energy savings • Sort layers based on energy and prune layers that consume most energy first • EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Ori. DC EAP Normalized Energy (AlexNet) 2.1x 3.7x x109 Magnitude Based Pruning Energy Aware Pruning Pruned models available at http://eyeriss.mit.edu/energy.html [Yang et al., CVPR 2017]
  • 18. 18 # of Operations vs. Latency • # of operations (MACs) does not approximate latency well Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)
  • 19. 19 NetAdapt: Platform-Aware DNN Adaptation • Automatically adapt DNN to a mobile platform to reach a target latency or energy budget • Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture) [Yang et al., ECCV 2018] NetAdapt Measure … Network Proposals Empirical Measurements Metric Proposal A … Proposal Z Latency 15.6 … 14.3 Energy 41 … 46 … … … Pretrained Network Metric Budget Latency 3.8 Energy 10.5 Budget Adapted Network … … Platform A B C D Z Code to be released at http://netadapt.mit.edu
  • 20. 20 Improved Latency vs. Accuracy Tradeoff • NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy +0.3% accuracy 1.7x faster +0.3% accuracy 1.6x faster Reference: MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017 MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018 *Tested on the ImageNet dataset and a Google Pixel 1 CPU [Yang et al., ECCV 2018]
  • 21. 21 Compression of Weights & Activations • Compress weights and activations between DRAM and accelerator • Variable Length / Huffman Coding • Tested on AlexNet à 2× overall BW Reduction [Moons et al., VLSI 2016; Han et al., ICLR 2016] Value: 16’b0 à Compressed Code: {1’b0} Value: 16’bx à Compressed Code: {1’b1, 16’bx} Example:
  • 22. 22 Compression Overhead Index (non-zero position info – e.g., IA and JA for CSR) accounts for approximately half of storage for fine grained pruning [Han et al., ICLR 2016]
  • 23. 23 Coarse-Grained Pruning … E output fmap … … many filters (M) Many Output Channels (M) M … R S 1 R S … … … C … M H input fmap … … … … C … C … … … W F May prune by eliminating entire filter planes or extremely sparse input activation planes or just a tile of either.
  • 24. 24 Structured/Coarse-Grained Pruning • Scalpel – Prune to match the underlying data-parallel hardware organization for speed up [Yu et al., ISCA 2017] Dense weights Sparse weights Example: 2-way SIMD
  • 25. 25 Exploit Redundant Weights • Preprocessing to reorder weights (ok since weights are known) • Perform addition before multiplication to reduce number of multiplies and reads of weights • Example: Input = [1 2 3 ] and filter [A B A] Typical processing: Output = A*1+B*2+A*3 If reorder as [A A B]: Output = A*(1+3)+B*1 3 multiplies and 3 weight reads 2 multiplies and 2 weight reads Note: Bitwidth of multiplication may need to increase [UCNN, ISCA 2018]
  • 26. 26 Exploit ReLU • Reduce number operations when if resulting activation will be negative as ReLU will set to zero • Need to either perform preprocess (sort weights) or minimize prediction overhead and error [PredictiveNet, ISCAS 2017], [SnaPEA, ISCA 2018], [Song et al., ISCA 2018]
  • 27. 27 Compact Network Architectures • Break large convolutional layers into a series of smaller convolutional layers – Fewer weights, but same effective receptive field • Before Training: Network Architecture Design (already discussed this morning; e.g., MobileNet) • After Training: Decompose Trained Filters
  • 28. 28 Decompose Trained Filters After training, perform low-rank approximation by applying tensor decomposition to weight kernel; then fine-tune weights for accuracy [Lebedev et al., ICLR 2015] R = canonical rank
  • 29. 29 Decompose Trained Filters [Denton et al., NeurIPS 2014] • Speed up by 1.6 – 2.7x on CPU/GPU for CONV1, CONV2 layers • Reduce size by 5 - 13x for FC layer • < 1% drop in accuracy Original Approx. Visualization of Filters
  • 30. 30 Decompose Trained Filters on Phone [Kim et al., ICLR 2016] Tucker Decomposition
  • 31. 31 Knowledge Distillation [Bucilu et al., KDD 2006],[Hinton et al., arXiv 2015] Complex DNN B (teacher) Simple DNN (student) softmax softmax Complex DNN A (teacher) softmax scores class probabilities Try to match