【Pytorch】CNN中的Attention

原创

于 2024-02-01 21:45:07 发布 · 1.9k 阅读

29 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #cnn #attention

更大层面上的Attention

在attention中，怎么分区channel-wise还是spatial-wise

为了更好地理解 “wise”，可以将其看作是一种特定维度或方面的强调。例如：
“time-wise” 表示与时间相关的事物。
在这种用法中，“wise” 帮助明确我们正在讨论的是哪一个特定的维度或方面。因此，当我们谈论 “channel-wise” attention 时，我们的焦点是在于如何以通道为基础进行操作；当我们谈论 “spatial-wise” attention 时，我们的焦点是在于空间位置或区域。

在神经网络中，特别是在处理图像或视频数据时，Attention 机制可以以不同的方式集中于输入数据的不同部分。在这些方法中，“channel-wise” 和 “spatial-wise” attention 是两种常见的方式。下面解释这两种方式：

Channel-wise Attention

含义：在 “channel-wise” attention 中，“wise” 指的是关注操作是针对不同的通道进行的。在图像处理中，通道通常指的是颜色通道（如RGB中的红、绿、蓝），或者在深度学习模型中，通道可以是不同的特征表示。
例子：如果一个图像处理模型正在处理一个具有多个通道的特征图，“channel-wise” attention 将决定哪些通道更重要，可能会增强一些通道的特征而减弱其他通道的特征。

应用：这种类型的attention在处理那些不同通道具有不同语义信息的数据时特别有用。例如，在卷积神经网络中，不同的卷积层可能会学习到代表不同高级特征的通道（如边缘、纹理等）。

Spatial-wise Attention

含义：在 “spatial-wise” attention 中，“wise” 指的是关注操作是针对图像或特征图的不同空间区域进行的。这种方法关注于图像中的不同位置，而不是整个图像作为一个整体。
例子：在对象检测任务中，“spatial-wise” attention 可能会集中于图像中包含重要对象的区域，而忽略其他不相关的区域。

应用：这种类型的attention在图像识别或对象检测等任务中特别有用，因为它可以帮助模型集中于图像中最重要的部分，例如，一个对象可能只占据图像的一小部分。

如何选择

选择 channel-wise 还是 spatial-wise attention 取决于具体任务和数据的特点。在某些情况下，甚至可以将两者结合起来，以便同时利用通道和空间信息。例如，一些高级神经网络架构在其attention机制中同时考虑了通道和空间维度，从而提高了模型对图像的理解能力。

举一个Spatial-Channel Attention的例子

例子来自于《Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization》中的AGSCA模块，图片和英文部分均为原文

Given audio features $\boldsymbol{a}_t \in \mathbb{R}^{d_a}$ and visual features $v_t \in \mathbb{R}^{d_v \times(H * W)}$ where $H$ and $W$ are the height and width of feature maps respectively, AGSCA first generates channel-wise attention maps $\boldsymbol{M}_t^c \in \mathbb{R}^{d_v \times 1}$ to adaptively emphasize informative features. It then produces spatial attention maps $\boldsymbol{M}_t^s \in \mathbb{R}^{1 \times(H * W)}$ for the channelattentive features to highlight sounding regions, yielding channelspatial attentive visual features $v_t^{c s}$ , as illustrated in Figure 3. The attention process can be summarized as, $\begin{aligned} & v_t^{c s}=\boldsymbol{M}_t^s \otimes\left(v_t^c\right)^T, \\ & v_t^c=\boldsymbol{M}_t^c \odot v_t, \end{aligned}$ where $\otimes$ denotes matrix multiplication, and $\odot$ means element-wise multiplication. We next separately introduce the channel-wise attention that generates attention maps $\boldsymbol{M}_t^c$ and spatial attention that produces attention maps $\boldsymbol{M}_t^s$