长期行动期望的未来变压器

论文标题

长期行动期望的未来变压器

Future Transformer for Long-term Action Anticipation

论文作者

Gong, Dayoung, Lee, Joonseok, Kim, Manjin, Ha, Seong Jong, Cho, Minsu

论文摘要

从视频中预测未来动作的任务对于与他人互动的现实世界代理人至关重要。当预期遥远的未来行动时，我们通常会考虑整个行动的长期关系，即不仅观察到过去的行动，而且在将来的潜在行动中观察到了行动。本着类似的精神，我们提出了一个被称为“未来变压器（FUTR）”的端到端注意模型，该模型利用全球关注在所有输入框架和输出令牌上，以预测长达几分钟的未来动作的顺序。与以前的自回旋模型不同，所提出的方法学会了在并行解码中预测未来动作的整个顺序，从而使长期预期更准确，更快地推断。我们对两个标准基准进行了长期行动预期，早餐和50种沙拉的方法评估我们的方法，从而实现了最先进的结果。

The task of predicting future actions from a video is crucial for a real-world agent interacting with others. When anticipating actions in the distant future, we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future. In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions. Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation. We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题