SlideShare a Scribd company logo
Massachusetts Institute of Technology
Song Han
“Once-for-All” DNNs: Simplifying Design of
Efficient Models for Diverse Hardware
The Rise of AIoT
IoT + AI = AIoT
Less Computational Resources: TinyML
Less Engineer Resources: AutoML
many engineers
large model
A lot of computation
fewer engineers small model
less computation
Simplify
Efficient AI Applications:
- Efficient Video recognition: TSM highlighted by NVIDIA and IBM, adopted by Baidu PaddlePaddle
and HKUST MMLab
- Efficient 3D point cloud recognition, autonomous driving: 1st place on SemanticKITTI leaderboard,
adopted by MIT Driverless
- Machine translation, NLP: Reduce the design cost by 4 orders of magnitude compared with Google.
Automated Tools:
- Pruning, Quantization, Compression: co-founded DeePhi Tech, acquired by Xilinx, industry standard.
- Two generations of automated NN architecture design (ProxylessNAS, OFA): adopted by Facebook
PyTorch and Amazon AutoGluon
- AI designed by AI outperforms human performance:
• 1st place, 3rd/4th Low-Power Computer Vision Challenge @ICCV’19, NeurIPS’19 [paper]
• 1st place, MicroNet Challenge, NLP track (WikiText-103), @NeurIPS’19
• 1st place, Visual Wake Words challenge on MCU @CVPR’19
Efficient Hardware & AI for EDA:
- EIE: first accelerator that support pruned and sparse weight. Influenced NVIDIA’s Ampere GPU,
NVIDIA’s DLA, ARM’s Project Trillium, Samsung’s NPU, and Intel’s NN Distiller.
Research Topics
TinyML and Efficient Deep Learning
Low LatencyLow EnergySmall Model Size
Fullstack
Automated
Research Topics
Full Stack Research
Algorithm
Hardware
Edge
Cloud
Training Inference
[Edge | Inference | Algorithm] Hardware-Aware Quantization CVPR 19 Oral
[Edge | Inference | Algorithm] Point-Voxel CNN for 3D DL NeurIPS 19 Spotlight
[Cloud | Training | Algorithm] Deep Leakage from Gradients NeurIPS 19
[Edge/Cloud | Inference | Algorithm] Once-for-All Network ICLR 20
[Edge | Inference | Algorithm] GAN Compression CVPR 20
[Edge/Cloud | Inference | Hardware] SpArch for Sparse Matrix Multiplication HPCA 20
[Edge | Inference | Algorithm] Temporal Shift Module for Video ICCV 19
[Edge | Inference | Algorithm] HAT: Hardware-Aware Transformer ACL 20
[Edge | Inference | Algorithm] Lite Transformer ICLR 20
Evolved Transformer ICML’19, ACL’19
We need Green AI:
Solve the Environmental Problem of NAS
Ours 52 4 orders of magnitude ACL’20
“Hardware-Aware Transformer”
TinyML comes at the cost of BigML
(inference) (training/search)
Problem:
Once-for-all, ICLR’20
Challenge: Efficient Inference on Diverse Hardware
Platforms
7
Diverse Hardware Platforms
…
Once-for-All Network
Cloud AI ( FLOPS)1012
Mobile AI ( FLOPS)109
Tiny AI ( FLOPS)106
160K
40K
1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission→
454.4k lbs CO2 emission→
45.4k lbs CO2 emission→
1 GPU hour translates to 0.284 lbs CO2 emission according to
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
Our Solution: Once for All Network
Once-for-All Network
Get many child nets
for free
Once-for-all, ICLR’20
OFA: Decouple Training and Search
9
Conventional NAS
with meta controller
For devices:
For search episodes: // meta controller
For training iterations:
forward-backward();
If good_model: break;
For post-search training iterations:
forward-backward();
Expensive
Expensive
=>
Once-for-All:
For OFA training iterations:
forward-backward();
For devices:
For search episodes:
sample from OFA;
If good_model: break;
directly deploy without training;
Expensive
training
search
decouple
Light-Weight
Light-Weight
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
10
once-for-all network
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
11
once-for-all network
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
12
once-for-all network
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
13
…
once-for-all network
Once-for-all, ICLR’20
Challenge: how to prevent different subnetworks
from interfering with each other?
14
Once-for-all, ICLR’20
Solution: Progressive Shrinking
15
• Training once-for-all network is much more challenging than training a normal
neural network given so many sub-networks to support.
• Progressive Shrinking can support more than different sub-networks in a
single once-for-all network, covering 4 different dimensions: resolution, kernel
size, depth, width.
1019
Once-for-all, ICLR’20 16
Train the
full model
Shrink the model
(4 dimensions)
Jointly fine-tune
both large and
small sub-networks
• Small sub-networks are nested in large sub-networks.
• Cast the training process of the once-for-all network as a progressive shrinking and
joint fine-tuning process.
once-for-all
network
Progressive Shrinking
Solution: Progressive Shrinking
• Training once-for-all network is much more challenging than training a normal
neural network given so many sub-networks to support.
• Progressive Shrinking can support more than different sub-networks in a
single once-for-all network, covering 4 different dimensions: resolution, kernel
size, depth, width.
1019
Once-for-all, ICLR’20
Connection to Network Pruning
17
Train the
full model
Shrink the model
(only width)
Fine-tune
the small net
single pruned
network
Network Pruning
Train the
full model
Shrink the model
(4 dimensions)
Fine-tune
both large and
small sub-nets
once-for-all
network
• Progressive shrinking can be viewed as a generalized network pruning with much
higher flexibility across 4 dimensions.
Progressive Shrinking
Once-for-all, ICLR’20
Progressive Shrinking
18
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
19
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
20
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
21
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
22
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
23
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
24
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
25
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
26
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size
Once-for-all, ICLR’20
Progressive Shrinking
27
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size
Once-for-all, ICLR’20
Progressive Shrinking
28
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size
Once-for-all, ICLR’20
Progressive Shrinking
29
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
30
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
31
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
32
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial
Once-for-all, ICLR’20
Progressive Shrinking
33
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial
Once-for-all, ICLR’20
Progressive Shrinking
34
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial
Once-for-all, ICLR’20
Progressive Shrinking
35
Elastic
Width
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
36
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
37
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
38
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
39
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
40
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
41
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20 42
Performances of Sub-networks on ImageNetImageNetTop-1Acc(%)
67
70
73
75
78
w/o PS w/ PS
D=2
W=3
K=3
D=2
W=3
K=7
D=2
W=6
K=3
D=2
W=6
K=7
D=4
W=3
K=3
D=4
W=3
K=7
D=4
W=6
K=3
D=4
W=6
K=7
2.5%
2.8%
3.5%
3.4% 3.3%
3.4%
3.7%
3.5%
Sub-networks under various architecture configurations
D: depth, W: width, K: kernel size
• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.
Once-for-all, ICLR’20
Train Once, Get Many
43
Once-for-all, ICLR’20
How about search? Zero training cost!
44
for OFA training iterations:
forward-backward();
for devices:
for search episodes:
sample from OFA;
if good_model: break;
training
search
decouple
direct deploy without training;
//with evolution or even random
Once-for-all, ICLR’20
How to evaluate if good_model? — by Model Twin
45
Acc Dataset
[Architecture, Accuracy]
Latency Dataset
[Architecture, Latency]
OFA
Network
Accuracy
Prediction Model
Accuracy/Latency predictor

RMSE ~0.2%
Latency
Prediction Model
Predictor-based
Architecture Search Specialized
Sub-Network
Once-for-all, ICLR’20
Our latency model is super accurate
46
Once-for-All, ICLR’20
Accuracy & Latency Improvement
47
Top-1ImageNetAcc(%)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA
EfficientNet
76.3
78.8
79.8
79.8
78.7
Google Pixel1 Latency (ms)
80.1 2.6x faster
3.8% higher
accuracy
Google Pixel1 Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA
MobileNetV3
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher
accuracy
1.5x faster
• Training from scratch cannot achieve the same level of accuracy
Once-for-All, ICLR’20
More accurate than training from scratch
48
Top-1ImageNetAcc(%)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA
EfficientNet
OFA - Train from scratch
76.3
78.8
79.8
79.8
78.7
Google Pixel1 Latency (ms)
80.1 2.6x faster
3.8% higher
accuracy
Google Pixel1 Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA
MobileNetV3
OFA - Train from scatch
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher
accuracy
1.5x faster
• Training from scratch cannot achieve the same level of accuracy
OFA: 80% Top-1 Accuracy on ImageNet
49
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNetTop-1accuracy(%)
2M 4M 8M
Handcrafted
16M
AutoML
32M 64M
→
→The higher the better
The lower the better
Once-for-All (ours)
EfficientNet
ProxylessNAS
MBNetV3
AmoebaNet
MBNetV2
PNASNet
ShuffleNet
DARTS
IGCV3-D
MobileNetV1 (MBNetV1)
NASNet-A
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
InceptionV3
DenseNet-264
DPN-92
ResNet-101
Xception
ResNetXt-101
14x less computation
595M MACs
80.0% Top-1
Model Size
• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under
the mobile vision setting (< 600M MACs).
Once-for-All, ICLR’20
Once-for-all, ICLR’20
OFA Enables Fast Specialization on Diverse Hardware Platforms
50
Samsung S7 Edge Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
25 40 55 70 85 100
OFA MobileNetV3 MobileNetV2
75.2
73.3
70.4
67.4
70.5
73.1
74.7
76.3
Google Pixel2 Latency (ms)
67
69
71
73
75
77
23 28 33 38 43 48 53 58 63 68
75.2
73.3
70.4
67.4
75.8
74.7
73.4
71.5
LG G8 Latency (ms)
67
69
71
73
75
77
7 10 13 16 19 22 25
75.2
73.3
70.4
67.4
76.4
74.7
73.0
71.1
Top-1ImageNetAcc(%)
58
62
66
69
73
77
10 14 18 22 26 30
NVIDIA 1080Ti Latency (ms)
Batch Size = 64
60.3
65.4
69.8
72.0
72.6
73.8
75.3 76.4
58
62
66
69
73
77
9 11 13 15 17 19
Intel Xeon CPU Latency (ms)
Batch Size = 1
60.3
65.4
69.8
72.0
71.1
74.6
75.7
72.0
58
62
66
69
73
77
3.0 4.0 5.0 6.0 7.0 8.0
Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 1 (Quantized)
59.1
63.3
69.0
71.5
67.0
69.6
72.8
73.7
• First place in the 3rd Low Power Computer Vision Challenge, DSP track at ICCV’19
• First place in the 4th Low Power Computer Vision Challenge, both classification and detection track
Qualcomm SnapDragon 855
Hexagon 690 DSP
OFA
Network
Specialized
Sub-network
Deploy
latency < 7ms
Latency: 5.15ms
Top1: 78.8%
Our result:
OFA’s Application: Low Power Computer Vision
Measured results on FPGA
OFA for FPGA
ArithmeticIntensity(OPS/Byte)
0.0
12.5
25.0
37.5
50.0
ZU3EGFPGA(GOPS/s)
0.0
20.0
40.0
60.0
80.0
MobileNetV2 MnasNet OFA (Ours)
40%
higher 57%
higher
Specialized NN architecture on specialized hardware architecture
Once-for-All, ICLR’20
Once-for-All, ICLR’20
Specialized Architecture for Different Hardware Platforms
53
Tutorial on ProxylessNAS & OFA
● IPython Notebook tutorial.

● Architecture search with 1 GPU in 2 minutes.

● Hands-on lab at 3:45pm PT today, office hour 6:00pm.

Zoom LinkWebsite
Once-for-All Network (OFA) has broad applications
• Efficient Video Recognition
• Efficient 3D Vision
• Efficient GAN Compression
KineticsTop-1Accuracy(%)
69
70
71
72
73
74
75
Computation (GFLOPs)
0 10 20 30 40
Same Acc.
OFA + TSM (large)
OFA + TSM (small)
MobileNetV2 + TSM
ResNet50 + TSM
ResNet50 + I3D
7x less computation
Same Comp.
+3.0% Acc.
followup of TSM, ICCV’19
OFA’s Application: Efficient Video Recognition
7x less computation, same performance as TSM+ResNet50
same computation, 3% higher accuracy than TSM+MobileNet-v2
Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x
Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput
59
Improving the Robustness of Online Video Detection
Guesture recognition
60
Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)

• GPU: 6 x NVIDIA Tesla V100

• RAM: 512 GB DDR4 memory

• Data Storage: HDD

• Connection: Dual-rail EDR InfiniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes 

(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x

Actual: 106x

Theoretical: 256x

Actual: 211x
128 SUMMIT Nodes 

(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes 

(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer
GAN Compression, CVPR’20
OFA’s Application: GAN Compression
8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN
1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU
OFA’s Application: Efficient 3D Recognition
self-driving: a whole trunk of GPU
Accuracy v.s. Latency Tradeoff
4x FLOPs reduction and 2x speedup over MinkowskiNet
3.6% better accuracy under the same computation budget.
AR/VR: a whole backpack
of computer
SPVNAS, ECCV’20
DarkNet53Seg
SPVNAS (Ours)
Mean IoU: 49.9
Throughput: 9.7 FPS
50.4M Params
376.3G FLOPs
Mean IoU: 58.8 (= KPConv)
Throughput: 11.8 FPS
1.1M Params
10.6G FLOPs
SPVNAS makes fewer errors (in red) than the 2D baseline model.
45x model size reduction and 35x computation reduction
SPVNAS, ECCV’20
Significantly Faster than MinkowskiNets
Mean IoU: 63.1 Throughput: 3.4 FPS
(21.7M Params 114.0G FLOPs)
Mean IoU: 63.6 Throughput: 6.5 FPS
(7.6M Params 30.0G FLOPs)
MinkowskiNet SPVNAS (Ours)
SPVNAS outperforms the state-of-the-art MinkowskiNet (with 2x measured speedup and 3x model size
reduction).
SPVNAS, ECCV’20
Qualitative Results on SemanticKITTI
Error By
MinkowskiNets
Less Error By
SPVNAS
Ground Truth
SPVNAS, ECCV’20
Qualitative Results on KITTI
Detection By
SECOND
More Accurate Detection By
SPVNAS
Ground Truth
SPVNAS, ECCV’20
Hardware-aware autoML, push-button solution
Make AI Efficient, with Tiny Resources Computational
Human{
ProxylessNAS, ICLR’19
HAQ, CVPR’19, oral
AMC, ECCV’18

Once-for-All, ICLR’20

Neural-Hardware Architecture Search, NeurIPS workshop’19

SPVNAS, ECCV’20
1st place, Low Power Computer Vision Challenge’19
1st place, Low Power Computer Vision Challenge’20
1st place, Visual Wake Words Challenge@CVPR’19
AutoML: Design Automation for AI [ECCV’18, ICLR’19, CVPR’19, ICLR’20, CVPR’20]
- We developed two generations of AutoML technique for efficient NN design (ProxylessNAS, OFA)
- Such AI designed AI consistently outperforms human performance:
- First place in Low Power Computer Vision Challenges (2019, 2020).
- First place in Visual Wake Words Challenge 2019.
AI for Design Automation [DAC’20]
- AI is Revolutionizing EDA: fast, hw, data-driven
- “GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and
Reinforcement Learning”, DAC’20
- Circuit is a graph; GCN feature extractor.
- Transfer ability between technology nodes & topologies
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT
Summary: Once-for-All Network
• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
net, image_size = ofa_specialized(net_id, pretrained=True)
• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
ofa_network = ofa_net(net_id, pretrained=True)
• We introduce once-for-all network for efficient inference on diverse hardware platforms.
• We present an effective progressive shrinking approach for training once-for-all networks.
Project Page: https://ofa.mit.edu
• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
• First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.
Train the
full model
Shrink the model
In 4 dimensions
Fine-tune
both large and
small sub-nets
once-for-all
network
Progressive Shrinking
Less Engineer Resources: AutoML
Less Computational Resources: TinyML
many engineers
large model
A lot of computation
fewer engineers small model
less computation
Simplify
The Future of AI is “Tiny”
Vast Applications
Smart Retail Personalized Healthcare
Smart Manufacturing Precision Agriculture
Smart Home
Autonomous Driving
Hardware for AI and Neural-net
itiative

Algebra”
Hardware for AI and Neural-net
Initiative

ear Algebra”
be pruned to very sparse,
ndex included). However, it’s
e of sparsity. EIE [Han’16] is
Hardware, AI and Neural-nets
TinyML and Efficient AI
Media:
songhan.mit.edu
youtube.com/c/MITHANLab
github.com/mit-han-lab

More Related Content

PPTX
Overview of Artificial Intelligence in Cybersecurity
PPTX
Cyber crimes in the digital age
PPTX
cybersecurity strategy planning in the banking sector
PDF
Artificial Intelligence for Cyber Security
PPTX
Cyber security and AI
PDF
AI for security or security for AI - Sergey Gordeychik
PPTX
Internet of vehical
PPTX
Iot ppt
Overview of Artificial Intelligence in Cybersecurity
Cyber crimes in the digital age
cybersecurity strategy planning in the banking sector
Artificial Intelligence for Cyber Security
Cyber security and AI
AI for security or security for AI - Sergey Gordeychik
Internet of vehical
Iot ppt

What's hot (20)

PPT
PDF
Combating Cyber Security Using Artificial Intelligence
PPTX
How is ai important to the future of cyber security
PDF
AI and Cybersecurity - Food for Thought
PDF
IoT ecosystem
PDF
HOW AI CAN HELP IN CYBERSECURITY
PPTX
Artificial Intelligence
PPT
Ai presentation
PDF
PPTX
security and privacy-Internet of things
PPTX
AI In Cybersecurity – Challenges and Solutions
PPTX
Blue Eye Technology
DOCX
Full seminar report on ethical hacking
PPTX
Cyber security with ai
PPTX
Cyber Security in AI (Artificial Intelligence)
PPTX
“AI techniques in cyber-security applications”. Flammini lnu susec19
PPTX
CHAPTER 4.pptx
PDF
INTERNATIONAL SECURITY MEASURES IN CYBERSPACE
PPTX
Cyber security system presentation
PDF
Cyber crime report
Combating Cyber Security Using Artificial Intelligence
How is ai important to the future of cyber security
AI and Cybersecurity - Food for Thought
IoT ecosystem
HOW AI CAN HELP IN CYBERSECURITY
Artificial Intelligence
Ai presentation
security and privacy-Internet of things
AI In Cybersecurity – Challenges and Solutions
Blue Eye Technology
Full seminar report on ethical hacking
Cyber security with ai
Cyber Security in AI (Artificial Intelligence)
“AI techniques in cyber-security applications”. Flammini lnu susec19
CHAPTER 4.pptx
INTERNATIONAL SECURITY MEASURES IN CYBERSPACE
Cyber security system presentation
Cyber crime report

Similar to “Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT (20)

PPTX
Weed Detection and Identification using Deep learning Techniques
PDF
Deep learning and reasoning: Recent advances
PDF
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
PDF
Open source ai_technical_trend
PDF
Deep learning 1.0 and Beyond, Part 2
PPTX
Deep Learning: Session 2 how to architect
PDF
Deep learning 1.0 and Beyond, Part 1
PDF
Architecture Design for Deep Neural Networks I
PDF
Webinar trends in machine learning ce adar july 9 2020 susan mckeever
PDF
Deep analytics via learning to reason
PPTX
Empower with visual charts (1)and llms and generative ai.pptx
PDF
Hardware Acceleration for Machine Learning
PDF
module 3 Artificial Intelligence and ML.
PDF
Deep Learning - Overview of my work II
PPTX
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
PDF
Deep Learning for Search: Personalization and Deep Tokenization
PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Neural Architecture Search: Learning How to Learn
PPTX
PDF
Big Data & Artificial Intelligence
Weed Detection and Identification using Deep learning Techniques
Deep learning and reasoning: Recent advances
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
Open source ai_technical_trend
Deep learning 1.0 and Beyond, Part 2
Deep Learning: Session 2 how to architect
Deep learning 1.0 and Beyond, Part 1
Architecture Design for Deep Neural Networks I
Webinar trends in machine learning ce adar july 9 2020 susan mckeever
Deep analytics via learning to reason
Empower with visual charts (1)and llms and generative ai.pptx
Hardware Acceleration for Machine Learning
module 3 Artificial Intelligence and ML.
Deep Learning - Overview of my work II
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Search: Personalization and Deep Tokenization
DLD meetup 2017, Efficient Deep Learning
Neural Architecture Search: Learning How to Learn
Big Data & Artificial Intelligence

More from Edge AI and Vision Alliance (20)

PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
PDF
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
PDF
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
PDF
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
PDF
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...

Recently uploaded (20)

PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PPTX
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
PDF
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
CroxyProxy Instagram Access id login.pptx
Modernizing your data center with Dell and AMD
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
GamePlan Trading System Review: Professional Trader's Honest Take
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reimagining Insurance: Connected Data for Confident Decisions.pdf
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
Smarter Business Operations Powered by IoT Remote Monitoring
Enable Enterprise-Ready Security on IBM i Systems.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Event Presentation Google Cloud Next Extended 2025
madgavkar20181017ppt McKinsey Presentation.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...

“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT

  • 1. Massachusetts Institute of Technology Song Han “Once-for-All” DNNs: Simplifying Design of Efficient Models for Diverse Hardware
  • 2. The Rise of AIoT IoT + AI = AIoT
  • 3. Less Computational Resources: TinyML Less Engineer Resources: AutoML many engineers large model A lot of computation fewer engineers small model less computation Simplify
  • 4. Efficient AI Applications: - Efficient Video recognition: TSM highlighted by NVIDIA and IBM, adopted by Baidu PaddlePaddle and HKUST MMLab - Efficient 3D point cloud recognition, autonomous driving: 1st place on SemanticKITTI leaderboard, adopted by MIT Driverless - Machine translation, NLP: Reduce the design cost by 4 orders of magnitude compared with Google. Automated Tools: - Pruning, Quantization, Compression: co-founded DeePhi Tech, acquired by Xilinx, industry standard. - Two generations of automated NN architecture design (ProxylessNAS, OFA): adopted by Facebook PyTorch and Amazon AutoGluon - AI designed by AI outperforms human performance: • 1st place, 3rd/4th Low-Power Computer Vision Challenge @ICCV’19, NeurIPS’19 [paper] • 1st place, MicroNet Challenge, NLP track (WikiText-103), @NeurIPS’19 • 1st place, Visual Wake Words challenge on MCU @CVPR’19 Efficient Hardware & AI for EDA: - EIE: first accelerator that support pruned and sparse weight. Influenced NVIDIA’s Ampere GPU, NVIDIA’s DLA, ARM’s Project Trillium, Samsung’s NPU, and Intel’s NN Distiller. Research Topics TinyML and Efficient Deep Learning Low LatencyLow EnergySmall Model Size Fullstack Automated
  • 5. Research Topics Full Stack Research Algorithm Hardware Edge Cloud Training Inference [Edge | Inference | Algorithm] Hardware-Aware Quantization CVPR 19 Oral [Edge | Inference | Algorithm] Point-Voxel CNN for 3D DL NeurIPS 19 Spotlight [Cloud | Training | Algorithm] Deep Leakage from Gradients NeurIPS 19 [Edge/Cloud | Inference | Algorithm] Once-for-All Network ICLR 20 [Edge | Inference | Algorithm] GAN Compression CVPR 20 [Edge/Cloud | Inference | Hardware] SpArch for Sparse Matrix Multiplication HPCA 20 [Edge | Inference | Algorithm] Temporal Shift Module for Video ICCV 19 [Edge | Inference | Algorithm] HAT: Hardware-Aware Transformer ACL 20 [Edge | Inference | Algorithm] Lite Transformer ICLR 20
  • 6. Evolved Transformer ICML’19, ACL’19 We need Green AI: Solve the Environmental Problem of NAS Ours 52 4 orders of magnitude ACL’20 “Hardware-Aware Transformer” TinyML comes at the cost of BigML (inference) (training/search) Problem:
  • 7. Once-for-all, ICLR’20 Challenge: Efficient Inference on Diverse Hardware Platforms 7 Diverse Hardware Platforms … Once-for-All Network Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106 160K 40K 1600K Design Cost (GPU hours) 11.4k lbs CO2 emission→ 454.4k lbs CO2 emission→ 45.4k lbs CO2 emission→ 1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
  • 8. Our Solution: Once for All Network Once-for-All Network Get many child nets for free
  • 9. Once-for-all, ICLR’20 OFA: Decouple Training and Search 9 Conventional NAS with meta controller For devices: For search episodes: // meta controller For training iterations: forward-backward(); If good_model: break; For post-search training iterations: forward-backward(); Expensive Expensive => Once-for-All: For OFA training iterations: forward-backward(); For devices: For search episodes: sample from OFA; If good_model: break; directly deploy without training; Expensive training search decouple Light-Weight Light-Weight
  • 10. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 10 once-for-all network
  • 11. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 11 once-for-all network
  • 12. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 12 once-for-all network
  • 13. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 13 … once-for-all network
  • 14. Once-for-all, ICLR’20 Challenge: how to prevent different subnetworks from interfering with each other? 14
  • 15. Once-for-all, ICLR’20 Solution: Progressive Shrinking 15 • Training once-for-all network is much more challenging than training a normal neural network given so many sub-networks to support. • Progressive Shrinking can support more than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width. 1019
  • 16. Once-for-all, ICLR’20 16 Train the full model Shrink the model (4 dimensions) Jointly fine-tune both large and small sub-networks • Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and joint fine-tuning process. once-for-all network Progressive Shrinking Solution: Progressive Shrinking • Training once-for-all network is much more challenging than training a normal neural network given so many sub-networks to support. • Progressive Shrinking can support more than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width. 1019
  • 17. Once-for-all, ICLR’20 Connection to Network Pruning 17 Train the full model Shrink the model (only width) Fine-tune the small net single pruned network Network Pruning Train the full model Shrink the model (4 dimensions) Fine-tune both large and small sub-nets once-for-all network • Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions. Progressive Shrinking
  • 18. Once-for-all, ICLR’20 Progressive Shrinking 18 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 19. Once-for-all, ICLR’20 Progressive Shrinking 19 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 20. Once-for-all, ICLR’20 Progressive Shrinking 20 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 21. Once-for-all, ICLR’20 Progressive Shrinking 21 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 22. Once-for-all, ICLR’20 Progressive Shrinking 22 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 23. Once-for-all, ICLR’20 Progressive Shrinking 23 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 24. Once-for-all, ICLR’20 Progressive Shrinking 24 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial
  • 25. Once-for-all, ICLR’20 Progressive Shrinking 25 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial
  • 29. Once-for-all, ICLR’20 Progressive Shrinking 29 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial
  • 30. Once-for-all, ICLR’20 Progressive Shrinking 30 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial Partial
  • 31. Once-for-all, ICLR’20 Progressive Shrinking 31 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial Partial
  • 32. Once-for-all, ICLR’20 Progressive Shrinking 32 Elastic Resolution Elastic Kernel Size Elastic Width Full Full Full Full Partial Partial Elastic Depth Partial
  • 33. Once-for-all, ICLR’20 Progressive Shrinking 33 Elastic Resolution Elastic Kernel Size Elastic Width Full Full Full Full Partial Partial Elastic Depth Partial
  • 34. Once-for-all, ICLR’20 Progressive Shrinking 34 Elastic Resolution Elastic Kernel Size Elastic Width Full Full Full Full Partial Partial Elastic Depth Partial
  • 35. Once-for-all, ICLR’20 Progressive Shrinking 35 Elastic Width Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Elastic Depth
  • 36. Once-for-all, ICLR’20 Progressive Shrinking 36 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 37. Once-for-all, ICLR’20 Progressive Shrinking 37 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 38. Once-for-all, ICLR’20 Progressive Shrinking 38 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 39. Once-for-all, ICLR’20 Progressive Shrinking 39 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 40. Once-for-all, ICLR’20 Progressive Shrinking 40 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 41. Once-for-all, ICLR’20 Progressive Shrinking 41 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 42. Once-for-all, ICLR’20 42 Performances of Sub-networks on ImageNetImageNetTop-1Acc(%) 67 70 73 75 78 w/o PS w/ PS D=2 W=3 K=3 D=2 W=3 K=7 D=2 W=6 K=3 D=2 W=6 K=7 D=4 W=3 K=3 D=4 W=3 K=7 D=4 W=6 K=3 D=4 W=6 K=7 2.5% 2.8% 3.5% 3.4% 3.3% 3.4% 3.7% 3.5% Sub-networks under various architecture configurations D: depth, W: width, K: kernel size • Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.
  • 44. Once-for-all, ICLR’20 How about search? Zero training cost! 44 for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break; training search decouple direct deploy without training; //with evolution or even random
  • 45. Once-for-all, ICLR’20 How to evaluate if good_model? — by Model Twin 45 Acc Dataset [Architecture, Accuracy] Latency Dataset [Architecture, Latency] OFA Network Accuracy Prediction Model Accuracy/Latency predictor
 RMSE ~0.2% Latency Prediction Model Predictor-based Architecture Search Specialized Sub-Network
  • 46. Once-for-all, ICLR’20 Our latency model is super accurate 46
  • 47. Once-for-All, ICLR’20 Accuracy & Latency Improvement 47 Top-1ImageNetAcc(%) 76 77 78 79 80 81 0 50 100 150 200 250 300 350 400 OFA EfficientNet 76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1ImageNetAcc(%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60 OFA MobileNetV3 75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster • Training from scratch cannot achieve the same level of accuracy
  • 48. Once-for-All, ICLR’20 More accurate than training from scratch 48 Top-1ImageNetAcc(%) 76 77 78 79 80 81 0 50 100 150 200 250 300 350 400 OFA EfficientNet OFA - Train from scratch 76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1ImageNetAcc(%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60 OFA MobileNetV3 OFA - Train from scatch 75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster • Training from scratch cannot achieve the same level of accuracy
  • 49. OFA: 80% Top-1 Accuracy on ImageNet 49 0 1 2 3 4 5 6 7 8 9 MACs (Billion) 69 71 73 75 77 79 81 ImageNetTop-1accuracy(%) 2M 4M 8M Handcrafted 16M AutoML 32M 64M → →The higher the better The lower the better Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101 14x less computation 595M MACs 80.0% Top-1 Model Size • Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile vision setting (< 600M MACs). Once-for-All, ICLR’20
  • 50. Once-for-all, ICLR’20 OFA Enables Fast Specialization on Diverse Hardware Platforms 50 Samsung S7 Edge Latency (ms) Top-1ImageNetAcc(%) 67 69 71 73 75 77 25 40 55 70 85 100 OFA MobileNetV3 MobileNetV2 75.2 73.3 70.4 67.4 70.5 73.1 74.7 76.3 Google Pixel2 Latency (ms) 67 69 71 73 75 77 23 28 33 38 43 48 53 58 63 68 75.2 73.3 70.4 67.4 75.8 74.7 73.4 71.5 LG G8 Latency (ms) 67 69 71 73 75 77 7 10 13 16 19 22 25 75.2 73.3 70.4 67.4 76.4 74.7 73.0 71.1 Top-1ImageNetAcc(%) 58 62 66 69 73 77 10 14 18 22 26 30 NVIDIA 1080Ti Latency (ms) Batch Size = 64 60.3 65.4 69.8 72.0 72.6 73.8 75.3 76.4 58 62 66 69 73 77 9 11 13 15 17 19 Intel Xeon CPU Latency (ms) Batch Size = 1 60.3 65.4 69.8 72.0 71.1 74.6 75.7 72.0 58 62 66 69 73 77 3.0 4.0 5.0 6.0 7.0 8.0 Xilinx ZU3EG FPGA Latency (ms) Batch Size = 1 (Quantized) 59.1 63.3 69.0 71.5 67.0 69.6 72.8 73.7
  • 51. • First place in the 3rd Low Power Computer Vision Challenge, DSP track at ICCV’19 • First place in the 4th Low Power Computer Vision Challenge, both classification and detection track Qualcomm SnapDragon 855 Hexagon 690 DSP OFA Network Specialized Sub-network Deploy latency < 7ms Latency: 5.15ms Top1: 78.8% Our result: OFA’s Application: Low Power Computer Vision
  • 52. Measured results on FPGA OFA for FPGA ArithmeticIntensity(OPS/Byte) 0.0 12.5 25.0 37.5 50.0 ZU3EGFPGA(GOPS/s) 0.0 20.0 40.0 60.0 80.0 MobileNetV2 MnasNet OFA (Ours) 40% higher 57% higher Specialized NN architecture on specialized hardware architecture Once-for-All, ICLR’20
  • 53. Once-for-All, ICLR’20 Specialized Architecture for Different Hardware Platforms 53
  • 54. Tutorial on ProxylessNAS & OFA ● IPython Notebook tutorial.
 ● Architecture search with 1 GPU in 2 minutes.
 ● Hands-on lab at 3:45pm PT today, office hour 6:00pm.
 Zoom LinkWebsite
  • 55. Once-for-All Network (OFA) has broad applications • Efficient Video Recognition • Efficient 3D Vision • Efficient GAN Compression
  • 56. KineticsTop-1Accuracy(%) 69 70 71 72 73 74 75 Computation (GFLOPs) 0 10 20 30 40 Same Acc. OFA + TSM (large) OFA + TSM (small) MobileNetV2 + TSM ResNet50 + TSM ResNet50 + I3D 7x less computation Same Comp. +3.0% Acc. followup of TSM, ICCV’19 OFA’s Application: Efficient Video Recognition 7x less computation, same performance as TSM+ResNet50 same computation, 3% higher accuracy than TSM+MobileNet-v2
  • 57. Latency Comparison Batch size=1. Measured on NVIDIA Tesla P100. Each row represents a video. I3D: Latency: 164.3 ms/Video Something-V1 Acc.: 41.6% TSM: Latency: 17.4 ms/Video Something-V1 Acc.: 43.4% Speed-up: 9x
  • 58. Throughput Comparison Batch size=16. Measured on NVIDIA Tesla P100. Each square represents a video. I3D: Throughput: 6.1 video/s Something-V1 Acc.: 41.6% TSM: Throughput: 77.4 video/s Something-V1 Acc.: 43.4% 12.7x larger throughput
  • 59. 59 Improving the Robustness of Online Video Detection
  • 61. Scaling Up: Large-Scale Distributed Training with Summit Super Computer SUMMIT Super Computer: • CPU: 2 x 16 Core IBM POWER9 (connected via dual NVLINK bricks, 25GB/s each side) • GPU: 6 x NVIDIA Tesla V100 • RAM: 512 GB DDR4 memory • Data Storage: HDD • Connection: Dual-rail EDR InfiniBand network of 23 GB/s Acknowledgment: IBM and Oak Ridge National Lab * Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
  • 62. ● We are able to speedup the training by 200x, from 2 days to 14minutes. ● Model setup: 8-frame ResNet-50 for video recognition ● Dataset: Kinetics (240k training videos) x 100 epoch Training Time Accuracy Peak GPU Performance Speed-up 1 SUMMIT Nodes 
 (6 GPUs) 49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x Actual: 106x Theoretical: 256x Actual: 211x 128 SUMMIT Nodes 
 (768 GPUs) 28min 74.1% 5,989TFLOP/s 256 SUMMIT Nodes 
 (1536 GPUs) 14min 74.0% 11,978TFLOP/s 0 12.5 25 37.5 50 Time (h) 1 SUMMIT Node 128 SUMMIT Node 106x Scaling Up: Large-Scale Distributed Training with SUMMIT Super Computer
  • 63. GAN Compression, CVPR’20 OFA’s Application: GAN Compression 8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN 1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU
  • 64. OFA’s Application: Efficient 3D Recognition self-driving: a whole trunk of GPU Accuracy v.s. Latency Tradeoff 4x FLOPs reduction and 2x speedup over MinkowskiNet 3.6% better accuracy under the same computation budget. AR/VR: a whole backpack of computer SPVNAS, ECCV’20
  • 65. DarkNet53Seg SPVNAS (Ours) Mean IoU: 49.9 Throughput: 9.7 FPS 50.4M Params 376.3G FLOPs Mean IoU: 58.8 (= KPConv) Throughput: 11.8 FPS 1.1M Params 10.6G FLOPs SPVNAS makes fewer errors (in red) than the 2D baseline model. 45x model size reduction and 35x computation reduction SPVNAS, ECCV’20
  • 66. Significantly Faster than MinkowskiNets Mean IoU: 63.1 Throughput: 3.4 FPS (21.7M Params 114.0G FLOPs) Mean IoU: 63.6 Throughput: 6.5 FPS (7.6M Params 30.0G FLOPs) MinkowskiNet SPVNAS (Ours) SPVNAS outperforms the state-of-the-art MinkowskiNet (with 2x measured speedup and 3x model size reduction). SPVNAS, ECCV’20
  • 67. Qualitative Results on SemanticKITTI Error By MinkowskiNets Less Error By SPVNAS Ground Truth SPVNAS, ECCV’20
  • 68. Qualitative Results on KITTI Detection By SECOND More Accurate Detection By SPVNAS Ground Truth SPVNAS, ECCV’20
  • 69. Hardware-aware autoML, push-button solution Make AI Efficient, with Tiny Resources Computational Human{ ProxylessNAS, ICLR’19 HAQ, CVPR’19, oral AMC, ECCV’18
 Once-for-All, ICLR’20
 Neural-Hardware Architecture Search, NeurIPS workshop’19
 SPVNAS, ECCV’20 1st place, Low Power Computer Vision Challenge’19 1st place, Low Power Computer Vision Challenge’20 1st place, Visual Wake Words Challenge@CVPR’19
  • 70. AutoML: Design Automation for AI [ECCV’18, ICLR’19, CVPR’19, ICLR’20, CVPR’20] - We developed two generations of AutoML technique for efficient NN design (ProxylessNAS, OFA) - Such AI designed AI consistently outperforms human performance: - First place in Low Power Computer Vision Challenges (2019, 2020). - First place in Visual Wake Words Challenge 2019. AI for Design Automation [DAC’20] - AI is Revolutionizing EDA: fast, hw, data-driven - “GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning”, DAC’20 - Circuit is a graph; GCN feature extractor. - Transfer ability between technology nodes & topologies
  • 72. Summary: Once-for-All Network • Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP). net, image_size = ofa_specialized(net_id, pretrained=True) • Released the training code & pre-trained OFA network that provides diverse sub-networks without training. ofa_network = ofa_net(net_id, pretrained=True) • We introduce once-for-all network for efficient inference on diverse hardware platforms. • We present an effective progressive shrinking approach for training once-for-all networks. Project Page: https://ofa.mit.edu • Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios, setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs). • First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19 • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection. Train the full model Shrink the model In 4 dimensions Fine-tune both large and small sub-nets once-for-all network Progressive Shrinking
  • 73. Less Engineer Resources: AutoML Less Computational Resources: TinyML many engineers large model A lot of computation fewer engineers small model less computation Simplify
  • 74. The Future of AI is “Tiny” Vast Applications Smart Retail Personalized Healthcare Smart Manufacturing Precision Agriculture Smart Home Autonomous Driving
  • 75. Hardware for AI and Neural-net itiative Algebra” Hardware for AI and Neural-net Initiative ear Algebra” be pruned to very sparse, ndex included). However, it’s e of sparsity. EIE [Han’16] is Hardware, AI and Neural-nets TinyML and Efficient AI Media: songhan.mit.edu youtube.com/c/MITHANLab github.com/mit-han-lab