【视频理解】TSN Temporal Segment Networks: Towards Good Practices for Deep Action Recognition 笔记

最新推荐文章于 2022-11-06 22:08:08 发布

鹿鹿最可爱

最新推荐文章于 2022-11-06 22:08:08 发布

阅读量672

点赞数

CC 4.0 BY-SA版权

分类专栏：视频算法文章标签： TSN 视频理解

本文链接：https://blog.csdn.net/qq_31622015/article/details/109130246

本文介绍了Temporal Segment Networks（TSN），一种用于深度动作识别的方法，它利用稀疏采样策略捕捉长距离时序依赖。TSN在ActivityNet、HMDB51和UCF101等数据集上表现出色。通过RGB、光流等模态的融合，以及特定的数据增强策略，如随机裁剪和翻转，提高了模型性能。此外，还探讨了预训练和部分BN层在优化网络中的作用。

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

https://arxiv.org/abs/1608.00859
https://github.com/yjxiong/temporal-segment-networks
https://github.com/yjxiong/tsn-pytorch

时间分段网络（TSN，Two-Stream）
结合了稀疏的时间采样策略和视频级别的监督，可以使用整个动作视频而不只是一个视频片段的信息进行高效的学习。测试时采用主流方法进行。
ActivityNet 2016竞赛的冠军(93.2% mAP)、HMDB51 ( 69.4%)、UCF101 (94.2%)

Paper

Method

视频中的长距离时序依赖

视频运动(motion)信息的处理和设法融合表象和运动信息是解决视频理解任务的关键

目前做动作识别的两大主流方法是3D卷积和two-stream，但这里两种方案能捕获的仅是视频中的短距离时序依赖。为了捕获长距离时序依赖，这些方法通常需要密采样视频片段clip(时序动作定位里，将视频分帧后，采用多尺度滑动窗口，比如滑动窗口为64，也就是每64帧图片为一个视频clip，视频分为若干个clip )

采用稀疏采样，利用整个视频的信息（相邻的帧有信息冗余）

TSN把视频分成3段，每个片段均匀地随机采样一个视频片段，并使用双流网络得到视频片段属于各类得分(softmax之前的值)，之后把不同片段得分取平均，最后通过softmax输出。下图K个spatial convnet的参数是共享的，K个temporal convnet的参数也是共享的

Details

Input

RGB
video中的某一帧
RGB difference
相邻两帧的差，可以用来表达动作信息
optical flow
warped optical flow

Modality

RGB/optical flow
1:1.5
RGB/optical flow/warped optical flow
1:1:0.5

Data argument

Random cropping
Horizontal flipping
Corner cropping
四角+中心，防止网络只关心中心位置
Scale and ratio jittering

other skills

Cross-modality pre-training
- spatial ConvNets
  用在ImageNet预训练模型对双流网络进行初始化
- temporal ConvNets
  交叉预训练，将图像领域的预训练模型迁移到光流领域
Partial BN with dropout
除了第一个之外的所有BN层的均值和标准差参数固定

Results

Pytorch

from torch import nn

from ops.basic_ops import ConsensusModule, Identity
from transforms import *
from torch.nn.init import normal, constant

class TSN(nn.Module):
    def __init__(self, num_class, num_segments, modality,
                 base_model='resnet101', new_length=None,
                 consensus_type='avg', before_softmax=True,
                 dropout=0.8,
                 crop_num=1, partial_bn=True):
        super(TSN, self).__init__()
        self.modality = modality
        self.num_segments = num_segments
        self.reshape = True
        self.before_softmax = before_softmax
        self.dropout = dropout
        self.crop_num = crop_num
        self.consensus_type = consensus_type
        if not before_softmax and consensus_type != 'avg':
            raise ValueError("Only avg consensus can be used after Softmax")

        if new_length is None:
            self.new_length = 1 if modality == "RGB" else 5
        else:
            self.new_length = new_length

        print(("""
Initializing TSN with base model: {}.
TSN Configurations:
    input_modality:     {}
    num_segments:       {}
    new_length:         {}
    consensus_module:   {}
    dropout_ratio:      {}
        """.format(base_model, self.modality, self.num_segments, self.new_length, consensus_type, self.dropout)))

        self._prepare_base_model(base_model)

        feature_dim = self._prepare_tsn(num_class)

        if self.modality == 'Flow':
            print("Converting the ImageNet model to a flow init model")
            self.base_model = self._construct_flow_model(self.base_model)
            print("Done. Flow model ready...")
        elif self.modality == 'RGBDiff':
            print("Converting the ImageNet model to RGB+Diff init model")
            self.base_model = self._construct_diff_model(self.base_model)
            print("Done. RGBDiff model ready.")

        self.consensus = ConsensusModule(consensus_type)

        if not self.before_softmax:
            self.softmax = nn.Softmax()

        self._enable_pbn = partial_bn
        if partial_bn:
            self.partialBN(True)

    def _prepare_tsn(self, num_class):
        feature_dim = getattr(self.base_model, self.base_model.last_layer_name).in_features
        if self.dropout == 0:
            setattr(self.base_model, self.base_model.last_layer_name, nn.Linear(feature_dim, num_class))
            self.new_fc = None
        else:
            setattr(self.base_model, self.base_model.last_layer_name, nn.Dropout(p=self.dropout))
            self.new_fc = nn.Linear(feature_dim, num_class)

        std = 0.001
        if self.new_fc is None:
            normal(getattr(self.base_model, self.base_model.last_layer_name).weight, 0, std)
            constant(getattr(self.base_model, self.base_model.last_layer_name).bias, 0)
        else:
            normal(self.new_fc.weight, 0, std)
            constant(self.new_fc.bias