刚才花2小时看完了DeepSeek V3 的 Technical Report，下面说下我的感想。首先，文章贡献主要来自系统（Training… | Bo Zeng

Engineering Manager, NLP & Machine Learning @ Airbnb AI Labs, Ex Facebook

4 个月已编辑

刚才花2小时看完了DeepSeek V3 的 Technical Report，下面说下我的感想。首先，文章贡献主要来自系统（Training Infra），而非模型本身。模型本身依然基于传统的Transformer： 1）他们世界首创在大规模LLM训练中系统性部署fp8（8位浮点）量化技术，这大大降低训练对显卡内存的需求，也加快了训练过程； 2）为了正确使用fp8的矩阵乘法，他们优化并改进了CUDA Kernal的调用方式，甚至给NVDA提出了诸多Tensor Core方面的设计建议 3）他们开发了自己的训练框架DualPipe，实现了16/64通道的流水线和专家（MOE）并行，极大改善了并行训练中的通信和计算冲突问题，解决了调度瓶颈。最终，DeepSeek实现了在2048个H800上的集群训练。其次，文章中大部分改进是渐进式的，而非革命性的： 1）上下文拓展实际上来自2023年文章YaRN；在MTP方面，最终DeepSeek V3只实现了N=1的MTP，也即比传统的GPT多预测一个词； 2）MOE所引入的Aux-Loss-Free Load Balancing技术，其实仅仅是在传统Expert的分配算法面前加入了一个bias term b_{i}； 3）DeepSeek MOE上的另一个革新是加入了“共享Expert”，并保证训练时对于每个Token，这些Expert最多分布在4个node上，以减少通信瓶颈。 4）其独创的Multihead Latent Attention 本质上是将QKV通过线性变换降维到一个Latent Space存入Cache，提高存储速度；这有利于推理任务加速。 5）利用自己在量化交易中的经验，创造性地将某些移动平均值（如Adam参数状态）存在CPU中，减少并行开销，等等当然，能够将如此多新的细节整合在一起，并获得一个几乎没有任何Loss Spike的平滑的训练框架，这不得不说是一个奇迹。最后，DeepSeek 在RL和蒸馏方面确实得到了极其宝贵的经验 Deep Seek证明了： 1）推理能力可以通过RL获得， 2）推理能力可有效的被蒸馏到更小的模型上去。虽然他们也同时观察到，蒸馏可能让小模型的输出变得更长，语言效率降低。此外，如果RL的Reward Model过于简单，这可能会让模型推理仅限于数学和代码任务总的来说，确实是一个非常好的Paper，证明了在极限的精度和优化条件下，训练一个600B大模型成本能走到多低。但不至于颠覆硅谷，是一个非常好的阶段性进展

35 条评论

Abhinay R. Dubey

AI Developer - Scalong AI| Ex- Wipro | MLOps & AI Enthusiast | Data Science Expert

4 个月

Great breakdown! The use of fp8 quantization and CUDA Kernal optimization is a testament to how hardware-level advancements can significantly reduce training costs and accelerate processes. The DualPipe framework brilliantly addresses bottlenecks with expert and pipeline parallelism. I’m particularly impressed by how RL and distillation were used to transfer reasoning abilities to smaller models, even though challenges in linguistic efficiency remain. These incremental yet impactful optimizations set a strong benchmark for efficient LLM training without being disruptive to Silicon Valley. Excellent analysis! 👏

3 次回应

LiangHao T.

Data Scientist/Engineer/Analyst & Data Lover & Potato's🐱Dad

4 个月

牛

Tianpei Luo

Software Engineer @ Meta Reality Labs (AR/VR)

4 个月

斯国一

3 次回应

Sepideh Maleki

Postdoctoral Research Fellow @ Genentech | Ph.D. in Computer Science

4 个月

"most of the improvements in the article are incremental, not revolutionary..." sounds like what reviewer 2 would say! While I agree with many of your points, dismissing the improvements as merely incremental discourages us from pursuing simple approaches, which often turn out to be the most effective. Even in ML conferences, if a paper is written clearly or the method is straightforward or "incremental", instead of celebrating it, we tend to see it as a major flaw.