DeepSeek-V3

​​​

1. Introduction

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.

2. Model Summary


Architecture: Innovative Load Balancing Strategy and Training Objective

  • On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.
  • We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.

Pre-Training: Towards Ultimate Training Efficiency

  • We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.
  • Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap.
    This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.
  • At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.

Post-Training: Knowledge Distillation from DeepSeek-R1

  • We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3.

3. Model Downloads

Model #Total Params #Activated Params Context Length Download
DeepSeek-V3-Base 671B 37B 128K 🤗 Hugging Face
DeepSeek-V3 671B 37B 128K 🤗 Hugging Face

[!NOTE] The total size of DeepSeek-V3 models on Hugging Face is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights.

To ensure optimal performance and flexibility, we have partnered with open-source communities and hardware vendors to provide multiple ways to run the model locally. For step-by-step guidance, check out Section 6: How_to Run_Locally.

For developers looking to dive deeper, we recommend exploring README_WEIGHTS.md for details on the Main Model weights and the Multi-Token Prediction (MTP) Modules. Please note that MTP support is currently under active development within the community, and we welcome your contributions and feedback.

4. Evaluation Results

Base Model

Standard Benchmarks
Benchmark (Metric) # Shots DeepSeek-V2 Qwen2.5 72B LLaMA3.1 405B DeepSeek-V3
Architecture - MoE Dense Dense MoE
# Activated Params - 21B 72B 405B 37B
# Total Params - 236B 72B 405B 671B
English Pile-test (BPB) - 0.606 0.638 0.542 0.548
BBH (EM) 3-shot 78.8 79.8 82.9 87.5
MMLU (Acc.) 5-shot 78.4 85.0 84.4 87.1
MMLU-Redux (Acc.)
### DeepSeek-V3 技术概述 DeepSeek-V3 是一种先进的大型语言模型(LLM),经过专门设计和优化,以增强其处理各种任务的能力。该模型不仅限于传统的聊天功能,还集成了多种外部工具的支持,从而显著提升了其实用性和灵活性[^1]。 #### 技术文档 对于 DeepSeek-V3 的技术文档,官方通常会提供详细的架构说明、训练方法以及性能评估等内容。这类文档旨在帮助开发者和技术人员深入了解模型的工作原理及其内部机制。具体来说: - **架构设计**:描述了模型的整体结构,包括使用的神经网络层类型、参数配置等细节。 - **训练过程**:涵盖了数据预处理方式、所采用的数据集规模及质量控制措施等方面的信息。 - **性能评测**:提供了不同应用场景下的测试结果对比分析,有助于理解模型的优势领域和发展方向。 #### 版本特性 DeepSeek-V3 相较前代产品引入了一系列改进特征,主要包括但不限于以下几个方面: - **更强的上下文理解能力**:通过优化编码器部分的设计,使得模型能更好地捕捉长距离依赖关系,在涉及复杂语境的任务上表现更加出色。 - **集成更多实用工具接口**:除了继续支持搜索引擎、编程环境外,新增加了一些特定领域的API接入点,例如金融资讯获取服务、医疗知识库查询等功能模块。 - **提升多模态融合水平**:加强图像识别与其他感知输入形式之间的协作效率,实现更为自然流畅的人机交互体验。 #### 使用指南 为了便于用户快速掌握如何有效利用 DeepSeek-V3 解决实际问题,建议遵循以下指导原则: - **熟悉基础命令语法**:学习基本指令格式,了解怎样向模型提问可以获得最理想的回应效果。 - **探索内置插件生态**:尝试不同的附加组件组合方案,找到最适合个人需求的最佳实践模式。 - **参与社区交流活动**:加入相关论坛讨论组或参加线下聚会分享经验心得,共同促进整个生态系统健康发展。 ```python # 示例代码展示如何初始化并调用 DeepSeek-V3 进行简单问答操作 from deepseek import DeepSeekV3 model = DeepSeekV3() response = model.ask("什么是量子力学?") print(response) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

thesky123456

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值