大模型qwen2.5研究

最新推荐文章于 2025-07-16 20:48:32 发布

原创

最新推荐文章于 2025-07-16 20:48:32 发布 · 1.6k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习

一、定义

1.架构是什么
2.如何初始化的
3.tie_word_embeddings 了解
4.语言模型类型有哪些？
5.qwen、qwen1.5、qwen2、qwen2.5 不同之处
6. 分词器粒度的优缺点？
7. 数据有什么特点？

二、实现

https://qwen.readthedocs.io/zh-cn/latest/#
1.qwen2.5 架构

类型：因果语言模型
架构：采用RoPE、SwiGLU、RMSNorm、注意力QKV偏置及词嵌入绑定的transformers架构

2.如何初始化的
线性层采用均值为0. 方差为0.02 的正态分布初始化。词向量采用均值为0. 方差为0.02 的正态分布初始化
小于7B 时，tie_word_embeddings =True, 大于7B，tie_word_embeddings= False
针对Qwen2-0.5B -Instruct这样的小模型，由于embedding参数量较大，Qwen2使用了tie embedding的方法让输入和输出层共享参数，增加非embedding参数的占比。

def post_init(self):
    """
    A method executed at the end of each Transformer model initialization, to execute code that needs the model's
    modules properly initialized (such as weight initialization).
    """
    self.init_weights()
    self._backward_compatibility_gradient_checkpointing()
def init_weights(self):
    """
    If needed prunes and maybe