长期飞跃注意力，短期定期转移视频分类

论文标题

长期飞跃注意力，短期定期转移视频分类

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

论文作者

Zhang, Hao, Cheng, Lechao, Hao, Yanbin, Ngo, Chong-Wah

论文摘要

视频变压器自然会产生比静态视觉变压器更重的计算负担，因为前者在二次复杂性$（t^2n^2）$的当前关注下的序列比后者更长。现有作品将颞轴视为空间轴的简单扩展，重点是通过通用池或局部窗口缩短时空序列，而无需使用时间冗余。但是，视频自然包含相邻框架之间的冗余信息；因此，我们可能会以扩张的方式抑制视觉上相似帧的注意力。基于这个假设，我们提出了圈序，这是一个长期的``\ textbf {\ textit {leap apative}}''（la），短期`\ textbf {\ textbf {\ textit {presceit shiftit {presceit shiftit {preckien shift}}''（\ textit {p} {p} -shift {p} -shift）for forde for Video complite for Video complite $（2）具体而言，``la''将长期帧分为对，然后通过注意来重构每个离散对。 ``\ textit {p} -shift''在时间邻居之间交换特征，以面对短期动力学的丧失。通过用圈量替换香草2D的注意，我们可以将静态变压器调整为视频，并使用零额外的参数和可忽视的计算开销（$ \ sim $ 2.6 \％）。对标准动力学-400基准的实验表明，我们的圈量变压器可以在CNN和Transformer Sotas之间的准确性，失败和参数方面实现竞争性能。我们在\ sloppy \ href {https://github.com/videonetworks/laps-transformer} {\ textit {\ textit {\ color {agenta} {https:/ https:/ https：//github.com/videonetworkss/laps-praps-transfornsformer}}中。

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes $T$ times longer sequence than the latter under the current attention of quadratic complexity $(T^2N^2)$. The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term ``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video transformers, with $(2TN^2)$ complexity. Specifically, the ``LA'' groups long-term frames into pairs, then refactors each discrete pair via attention. The ``\textit{P}-Shift'' exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead ($\sim$2.6\%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in \sloppy \href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .

下载PDF全文

下载文献需遵守相关版权规定

论文标题