论文阅读笔记AI篇 —— Transformer模型理论+实战 (三)

原创

已于 2024-01-21 09:58:20 修改 · 1.6k 阅读

26 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #论文阅读 #笔记

于 2024-01-19 21:20:12 首次发布

本文详细解读了Transformer模型中的Attention与Self-Attention的区别，堆叠结构，以及PositionalEncoding的作用。同时提及了Transformer模型中Encoder-Decoder的设计，并涉及BLEU在机器翻译评估中的应用，展示了模型背后的复杂性和黑箱特性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文阅读笔记AI篇 —— Transformer模型理论+实战（三）

第三遍阅读（精读）

第三遍阅读（精读）

精读的过程要把每个细节都钻研透，不留有死角。各种维度参数已经在“理论+实战（二）”中说清楚了，若之后还有疑问我再补上。

三、参考文章或视频链接
[1] 【超强动画，一步一步深入浅出解释Transformer原理！】

3.1 Attention和Self-Attention的区别？

3.1 参考文章或视频链接
[1] What’s the difference between Attention vs Self-Attention? What problems does each other solve that the other can’t?
[2] What’s the Difference Between Self-Attention and Attention in Transformer Architecture?

3.2 Transformer是如何进行堆叠的？

原文提到了Encoder与Decoder是可以进行 $N\times$ 堆叠的，那么堆叠之后的结构是什么？可以看到这就是堆叠之后的结构，这里的features是中间编码，6层decoder，每一层都需要拿features作为输入的一部分，这种设计思想也类似于ResNet。

图1 —— 来自参考文章[1]

再看到原始的Transformer结构图中，对Outputs提到了一个(shifted right)，这是什么意思？参考文章[4]中的动图诠释了这一点，shifted right是说不停的拿最新的预测词作为Outputs的输入，其实仔细想想，你写文章也绝对不可能是写下一个词语而不依赖上一句，一定是有前文的信息作为输入，才能让你流畅的写出下一个词语的，聊天在一定程度上就是拽着话头，话赶话。

图2 —— 来自参考文章[4]

3.5 参考文章或视频链接
[1] Transformer’s Encoder-Decoder Let’s Understand The Model Architecture
[2] What is purpose of stacking N=6 blocks of encoder and decoder in transformer?
[3] Stacked encoder and decoder blocks used in Transformers
[4] The Transformer Model - A Step by Step Breakdown of the Transformer’s Encoder-Decoder Architecture