TensorFlow 2.0中Luong Attention原理的核心源码实现

最新推荐文章于 2024-10-21 11:23:04 发布

ErbaoLiu

最新推荐文章于 2024-10-21 11:23:04 发布

阅读量1.9k

点赞数 3

CC 4.0 BY-SA版权

分类专栏：算法

本文链接：https://blog.csdn.net/L_15156024189/article/details/105673546

算法专栏收录该内容

2 篇文章

订阅专栏

本文深入解析了基于向量点积的注意力机制Luong Attention，详细介绍了tf.keras.layers下的Attention类实现，包括其参数、调用方式及内部计算过程，特别关注了score计算、权重缩放、mask应用以及Attention类的初始化和call方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

请先阅读与本文相关的原理《智能聊天系统——Attention Mechanism（注意力机制）》。

Luong-style attention

https://tensorflow.google.cn/api_docs/python/tf/keras/layers/Attention

在tf.keras.layers下的Attention类实现的是Luong风格的attention，也称为Dot-product attention layer（基于向量点积的注意力机制），当然Luong attention属于其中一种。Attention类如下：

tf.keras.layers.Attention(
    use_scale=False, **kwargs
)

可以如下调用：

query_value_attention_seq = tf.keras.layers.Attention()(
    [query_seq_encoding, value_seq_encoding])

Attention类调用有两个参数：

（1）inputs

inputs是由3个张量（tensor）构成的列表（list），3个张量分别如下：

query：大小[batch_size, Tq, dim]

value：大小[batch_size, Tv, dim]

key：大小[batch_size, Tv, dim]

其中key是可选的，如果没有给定，则使用value作为key，

（2）mask

mask是两个布尔张量的列表，两个张量如下：

query_mask：大小[batch_size, Tq]，取值True或者False

value_mask：大小[batch_size, Tv]，取值True或者False

输出：一个张量，大小[batch_size, Tq, dim]

下面对照Luong Attention的相关原理来看，

先看Attention类的初始化：

  def __init__(self, use_scale=False, **kwargs):
    super(Attention, self).__init__(**kwargs)
    self.use_scale = use_scale

初始化第一个参数是use_scale，取值True或者False，它表示是否对计算出来的得分（score）进行权重缩放，如果是True，进行权重缩放，如果False，不进行缩放，默认值是False，后面会说。通常我们使用默认值。也就是如下初始化：

tf.keras.layers.Attention()

下面看下score如何计算的？看着之前，需要搞清楚，Attention类初始化后，传入的三个参数query,key,value到底是什么？及具体和应用场景相关，通常query是先固定，然后与各个key做向量点积，然后归一化作为各个value的权重系数。举个例子：

假设

query：[1,1,1]

key：[2,2,2]，[3,3,3]

value:[4,4,4]，[5,5,5]

上面说了他们的维度如下：

[batch_size, Tq, dim]

[batch_size, Tv, dim]

对应这个例子，batch_size=1，Tq=1，Tv=2，dim=3。下面先来做query和key的点积运算得到score，如下：

(6,9)

对应原理中公式就是：

所以query对应decoder阶段的hidden state $\mathbf{h}_t$ ，key对应encoder阶段的hidden state $\mathbf{\bar{h}}_i$ 。源码实现如下：

def _calculate_scores(self, query, key):
    """Calculates attention scores as a query-key dot product.
    Args:
      query: Query tensor of shape `[batch_size, Tq, dim]`.
      key: Key tensor of shape `[batch_size, Tv, dim]`.
    Returns:
      Tensor of shape `[batch_size, Tq, Tv]`.
    """
    scores = math_ops.matmul(query, key, transpose_b=True)
    if self.scale is not None:
      scores *= self.scale
    return scores

对应代码scores = math_ops.matmul(query, key, transpose_b=True)

transpose_b=True表示对key转置，和公式有些区别，公式中是对key转置，但这不影响结果。关于self.scale，它是在Attention的build方法中赋值的，源码如下：

 def build(self, input_shape):
    """Creates scale variable if use_scale==True."""
    if self.use_scale:
      self.scale = self.add_weight(
          name='scale',
          shape=(),
          initializer=init_ops.ones_initializer(),
          dtype=self.dtype,
          trainable=True)
    else:
      self.scale = None
    super(Attention, self).build(input_shape)

也就是如果初始化Attention类时，use_scale=True，那么就给self.scale赋值一个权重，后面再计算出scores后乘以这个权重。

上面我们提到Attention的使用是通过如下方式调用的：

query_value_attention_seq = tf.keras.layers.Attention()(
    [query_seq_encoding, value_seq_encoding])

实际上它会调用一个叫做call()的方法，在Attention类中并看不到这个方法，而我们看到Attention类继承如下父类：

@keras_export('keras.layers.Attention')
class Attention(BaseDenseAttention):

在父类BaseDenseAttention中定义了call方法，源码如下：

def call(self, inputs, mask=None):
    self._validate_call_args(inputs=inputs, mask=mask)
    q = inputs[0]
    v = inputs[1]
    k = inputs[2] if len(inputs) > 2 else v
    q_mask = mask[0] if mask else None
    v_mask = mask[1] if mask else None
    scores = self._calculate_scores(query=q, key=k)
    if v_mask is not None:
      # Mask of shape [batch_size, 1, Tv].
      v_mask = array_ops.expand_dims(v_mask, axis=-2)
    if self.causal:
      # Creates a lower triangular mask, so position i cannot attend to
      # positions j>i. This prevents the flow of information from the future
      # into the past.
      scores_shape = array_ops.shape(scores)
      # causal_mask_shape = [1, Tq, Tv].
      causal_mask_shape = array_ops.concat(
          [array_ops.ones_like(scores_shape[:-2]), scores_shape[-2:]],
          axis=0)
      causal_mask = _lower_triangular_mask(causal_mask_shape)
    else:
      causal_mask = None
    scores_mask = _merge_masks(v_mask, causal_mask)
    result = self._apply_scores(scores=scores, value=v, scores_mask=scores_mask)
    if q_mask is not None:
      # Mask of shape [batch_size, Tq, 1].
      q_mask = array_ops.expand_dims(q_mask, axis=-1)
      result *= math_ops.cast(q_mask, dtype=result.dtype)
    return result

代码self._validate_call_args(inputs=inputs, mask=mask)对输入参数进行校验，具体略过，继续向下看：

q = inputs[0]
v = inputs[1]
k = inputs[2] if len(inputs) > 2 else v

这三行代码从输入inputs列表中分别获取query q，value v和key k的值，在获取k的值时，如果没有给定取v的值。

q_mask = mask[0] if mask else None
v_mask = mask[1] if mask else None

这两行代码分别获取query mask和value mask的值。

scores = self._calculate_scores(query=q, key=k)

这行代码计算得分scores，前面已经解释，关于mask的部分我们暂时忽略。

result = self._apply_scores(scores=scores, value=v, scores_mask=scores_mask)

这行代码调用了父类BaseDenseAttention的_apply_scores方法，源码如下：

  def _apply_scores(self, scores, value, scores_mask=None):
    if scores_mask is not None:
      padding_mask = math_ops.logical_not(scores_mask)
      # Bias so padding positions do not contribute to attention distribution.
      scores -= 1.e9 * math_ops.cast(padding_mask, dtype=K.floatx())
    attention_distribution = nn.softmax(scores)
    return math_ops.matmul(attention_distribution, value)

该方法主要完成score的归一化，然后作为权重对value进行加权，公式如下：

归一化对应代码：attention_distribution = nn.softmax(scores)

对value加权对应代码：math_ops.matmul(attention_distribution, value)

从代码中，我们能看到，Luong Attention中value对应的是decoder阶段hidden state $\mathbf{\bar{h}}_i$ 。这也解释了，在Attention类初始化时，如果key没有给定，则key=value，也就是key和value是一样的值，都是decoder阶段hidden state $\mathbf{\bar{h}}_i$ 。

现在我们回顾头来看输入参数mask的作用，在原理中我们给出了Luong Attention中的Global Attention和Local Attention，如果忽略mask就是考虑全局，Global Attention。如果有mask就是局部。关于mask的api文档解释如下：