【视频理解】TSN Temporal Segment Networks: Towards Good Practices for Deep Action Recognition 笔记

本文介绍了Temporal Segment Networks(TSN),一种用于深度动作识别的方法,它利用稀疏采样策略捕捉长距离时序依赖。TSN在ActivityNet、HMDB51和UCF101等数据集上表现出色。通过RGB、光流等模态的融合,以及特定的数据增强策略,如随机裁剪和翻转,提高了模型性能。此外,还探讨了预训练和部分BN层在优化网络中的作用。

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

https://arxiv.org/abs/1608.00859
https://github.com/yjxiong/temporal-segment-networks
https://github.com/yjxiong/tsn-pytorch

时间分段网络(TSN,Two-Stream)
结合了稀疏的时间采样策略和视频级别的监督,可以使用整个动作视频而不只是一个视频片段的信息进行高效的学习。测试时采用主流方法进行。
ActivityNet 2016竞赛的冠军(93.2% mAP)、HMDB51 ( 69.4%)、UCF101 (94.2%)

Paper

Method

视频中的长距离时序依赖

视频运动(motion)信息的处理和设法融合表象和运动信息是解决视频理解任务的关键

目前做动作识别的两大主流方法是3D卷积和two-stream,但这里两种方案能捕获的仅是视频中的短距离时序依赖。为了捕获长距离时序依赖,这些方法通常需要密采样视频片段clip(时序动作定位里,将视频分帧后,采用多尺度滑动窗口,比如滑动窗口为64,也就是每64帧图片为一个视频clip,视频分为若干个clip )

采用稀疏采样,利用整个视频的信息(相邻的帧有信息冗余)

TSN把视频分成3段,每个片段均匀地随机采样一个视频片段,并使用双流网络得到视频片段属于各类得分(softmax之前的值),之后把不同片段得分取平均,最后通过softmax输出。下图K个spatial convnet的参数是共享的,K个temporal convnet的参数也是共享的

Details

Input
  • RGB
    video中的某一帧
  • RGB difference
    相邻两帧的差,可以用来表达动作信息
  • optical flow
  • warped optical flow
Modality
  • RGB/optical flow
    1:1.5
  • RGB/optical flow/warped optical flow
    1:1:0.5
Data argument
  • Random cropping
  • Horizontal flipping
  • Corner cropping
    四角+中心,防止网络只关心中心位置
  • Scale and ratio jittering

other skills

  • Cross-modality pre-training
    • spatial ConvNets
      用在ImageNet预训练模型对双流网络进行初始化
    • temporal ConvNets
      交叉预训练,将图像领域的预训练模型迁移到光流领域
  • Partial BN with dropout
    除了第一个之外的所有BN层的均值和标准差参数固定

Results

Pytorch

from torch import nn

from ops.basic_ops import ConsensusModule, Identity
from transforms import *
from torch.nn.init import normal, constant

class TSN(nn.Module):
    def __init__(self, num_class, num_segments, modality,
                 base_model='resnet101', new_length=None,
                 consensus_type='avg', before_softmax=True,
                 dropout=0.8,
                 crop_num=1, partial_bn=True):
        super(TSN, self).__init__()
        self.modality = modality
        self.num_segments = num_segments
        self.reshape = True
        self.before_softmax = before_softmax
        self.dropout = dropout
        self.crop_num = crop_num
        self.consensus_type = consensus_type
        if not before_softmax and consensus_type != 'avg':
            raise ValueError("Only avg consensus can be used after Softmax")

        if new_length is None:
            self.new_length = 1 if modality == "RGB" else 5
        else:
            self.new_length = new_length

        print(("""
Initializing TSN with base model: {}.
TSN Configurations:
    input_modality:     {}
    num_segments:       {}
    new_length:         {}
    consensus_module:   {}
    dropout_ratio:      {}
        """.format(base_model, self.modality, self.num_segments, self.new_length, consensus_type, self.dropout)))

        self._prepare_base_model(base_model)

        feature_dim = self._prepare_tsn(num_class)

        if self.modality == 'Flow':
            print("Converting the ImageNet model to a flow init model")
            self.base_model = self._construct_flow_model(self.base_model)
            print("Done. Flow model ready...")
        elif self.modality == 'RGBDiff':
            print("Converting the ImageNet model to RGB+Diff init model")
            self.base_model = self._construct_diff_model(self.base_model)
            print("Done. RGBDiff model ready.")

        self.consensus = ConsensusModule(consensus_type)

        if not self.before_softmax:
            self.softmax = nn.Softmax()

        self._enable_pbn = partial_bn
        if partial_bn:
            self.partialBN(True)

    def _prepare_tsn(self, num_class):
        feature_dim = getattr(self.base_model, self.base_model.last_layer_name).in_features
        if self.dropout == 0:
            setattr(self.base_model, self.base_model.last_layer_name, nn.Linear(feature_dim, num_class))
            self.new_fc = None
        else:
            setattr(self.base_model, self.base_model.last_layer_name, nn.Dropout(p=self.dropout))
            self.new_fc = nn.Linear(feature_dim, num_class)

        std = 0.001
        if self.new_fc is None:
            normal(getattr(self.base_model, self.base_model.last_layer_name).weight, 0, std)
            constant(getattr(self.base_model, self.base_model.last_layer_name).bias, 0)
        else:
            normal(self.new_fc.weight, 0, std)
            constant(self.new_fc.bias
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值