几篇论文

本文汇总了关于深度学习训练的最新进展,包括AlexNet和ResNet50的训练经验,探讨了allreduce架构在ImageNet训练中的应用,如hierarchical allreduce和2D-Torus。同时,介绍了加速训练的技巧,如AdamW优化算法、SmoothOut、Super-Convergence以及Cyclical Learning Rates。此外,还涉及PPO政策梯度算法的相关研究和其他值得关注的论文解读,如Gradient Harmonized Single-stage Detector。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

训练ImageNet记录

AlexNet

 Batch SizeProcessorGPU InterconnectTimeTop-1 Accuracy
You et al.512DGX-1 station NVLink6 hours 10 mins58.80%
You et al.32KCPU x 1024 - 11 mins58.60%
Jia et al.64KPascal GPU x 1024 100 Gbps5 mins58.80%
Jia et al.64KPascal GPU x 2048100 Gbps4 mins58.70%
Sun et al.(DenseCommu)64KVolta GPU x 51256 Gbps2.6 mins58.70%
Sun et al.(SparseCommu)64KVolta GPU x 51256 Gbps1.5 mins58.20%

ResNet50

 Batch SizeProcessorGPU InterconnectTimeTop-1 Accuracy
Goyal et al.8KPascal GPU x 25656 Gbps1 hour76.30%
Smith et al.16KFull TPU Pod  - 30 mins76.10%
Codreanu et al.32KKNL x 1024 - 42 mins75.30%
You et al.32KKNL x 2048 - 20 mins75.40%
Akiba et al.32KPascal GPU x 1024 56 Gbps15 mins74.90%
Jia et al.64KPascal GPU x 1024 100 Gbps8.7 mins76.20%
Jia et al.64KPascal GPU x 2048100 Gbps6.6 mins75.80%
Mikami et al.68KVolta GPU x2176200 Gbps3.7 mins75.00%
Ying et al.32KTPU v3 x 1024 - 2.2 mins76.30%
Sun et al.64KVolta GPU x 51256 Gbps7.3 mins75.30%

 

allreduce架构:

1.  hierarchical allreduce: Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes

2. 2D-Torus by sony: ImageNet/ResNet-50 Training in 224 Seconds   

  中文解读:224秒!ImageNet上训练ResNet-50最佳战绩出炉,索尼下血本破纪录

3. 2D-Torus by google: Image Classification at Supercomputer Scale

   中文解读:谷歌刷新世界纪录!2分钟搞定ImageNet训练

4.  topology-aware: BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning

 

加速相关

1. AdamW and Super-convergence is now the fastest way to train neural nets

   中文解读:当前训练神经网络最快的方式:AdamW优化算法+超级收敛

   Fixing Weight Decay Regularization in Adam

2. SmoothOut:Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

3. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

4. Cyclical Learning Rates for Training Neural Networks

    中文解读:https://blog.csdn.net/guojingjuan/article/details/53200776

5. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

 

PPO相关:

1. Proximal Policy Optimization Algorithms

2. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?

2. An Empirical Model of Large-Batch Training

 

其他:

1. Gradient Harmonized Single-stage Detector

    中文解读:梯度协调单级探测器 http://tongtianta.site/paper/8075

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值