【笔记】Transformer中的LayerNorm是对某一样本的某一个patch下的所有维度进行正则化：而传统的LayerNorm是对一个batch中的某一个样本的所有channels进行正则化

原创

已于 2024-08-14 10:22:31 修改 · 314 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#笔记 #transformer #batch

于 2024-08-02 12:10:07 首次发布

注：

【笔记】BatchNorm 、LayerNorm 、InstanceNorm的区别：对 mini-batch 内的每个通道进行归一化；对每个样本的所有通道进行归一化；对每个样本的每个通道独立进行归一化_通道归一化-CSDN博客

Norm_layer 参数我们传入的是 nn.LayerNorm.

我们查看下layerNorm的信息：

self.norm = norm_layer(embed_dim)
print(self.norm)

LayerNorm((768,), eps=1e-06, elementwise_affine=True)

Key Code:

training head.bias
  0%|          | 0/368 [00:00<?, ?it/s]tensor([ 6.4645, -4.7138,  4.6655, -5.7235, -2.1071, -3.0778, -3.2277,  1.0825,
        -3.1296,  6.2348], device='cuda:0')
tensor(-0.0354, device='cuda:0') tensor(6.8029, device='cuda:0')
tensor([ 0.1896, -0.1529,  0.1320, -0.1904, -0.0908, -0.1010, -0.1128,  0.0217,
        -0.1351,  0.2111], device='cuda:0')
z= tensor([ 0.1895, -0.1528,  0.1319, -0.1902, -0.0907, -0.1009, -0.1128,  0.0217,
        -0.1350,  0.2109], device='cuda:0')
torch.Size([8, 197, 768])

Total Code:

"""
original code from rwightman:
https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
"""
from functools import partial
from collections import OrderedDict

import torch
import torch.nn as nn


def drop_path(x, drop_prob: float = 0., training: bool = False):
    """
    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
    'survival rate' as the argument.
    """
    if drop_prob == 0. or not training: # if there is no throwing or no training
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)  # torch.rand()  [0,1)
    random_tensor.floor_()  # binarize                 # random_tensor.floor_() 会将每个浮点数向下取整
    output = x.div(keep_prob) * random_tensor
    return output

        # x = torch.tensor([[1.0, 2.0],
        #                   [3.0, 4.0],
        #                   [5.0, 6.0],
        #                   [7.0, 8.0]])
        # keep_prob = 1 - drop_prob
        # keep_prob = 1 - 0.5
        # keep_prob = 0.5
        # shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        # shape = (4,) + (1,) * (2 - 1)
        # shape = (4, 1)
        #
        # random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
        # random_tensor = 0.5 + torch.rand((4, 1))
        #
        # # 生成的 random_tensor 可能类似于
        # random_tensor = torch.tensor([[0.8],
        #                               [0.3],
        #                               [0.7],
        #                               [0.1]])
        # random_tensor.floor_()
        # random_tensor = torch.tensor([[1.0],
        #                               [0.0],
        #                               [1.0],
        #                               [0.0]])
        # output = x.div(keep_prob) * random_tensor
        # output = x.div(0.5) * random_tensor
        #
        # # 计算每个元素
        # output = torch.tensor([[1.0 / 0.5, 2.0 / 0.5],
        #                        [3.0 / 0.5, 4.0 / 0.5],
        #                        [5.0 / 0.5, 6.0 / 0.5],
        #                        [7.0 / 0.5, 8.0 / 0.5]]) * random_tensor
        #
        # output = torch.tensor([[2.0, 4.0],
        #                        [6.0, 8.0],
        #                        [10.0, 12.0],
        #                        [14.0, 16.0]]) * random_tensor
        #
        # # 计算结果
        # output = torch.tensor([[2.0, 4.0],
        #                        [0.0, 0.0],
        #                        [10.0, 12.0],
        #                        [0.0, 0.0]])

# 在这个例子中，每个样本的路径被随机丢弃。
# 对于丢弃的路径，输出张量中的对应元素被设置为0，而未丢弃的路径的值按比例放大，以保持整体的期望值不变。
#
# 这种方法在深度学习中的作用类似于 Dropout，但它是对整个路径（例如残差块）进行随机丢弃，而不是对单个神经元。
# 这种技术能够提高模型的鲁棒性和泛化能力。


class DropPath(nn.Module):
    """
    Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward