论文标题
探索调制检测变压器作为视频中动作识别的工具
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
论文作者
论文摘要
近年来,变形金刚的体系结构在受欢迎程度越来越高。调制检测变压器(MDETR)是一个端到端的多模式理解模型,该模型执行诸如相位接地,引用表达理解,参考表达分割和视觉问题答案之类的任务。该模型的一个了不起的方面是推断以前未经培训的类别的能力。在这项工作中,我们探讨了MDETR在一项新任务中的使用,即动作检测,而没有任何以前的培训。我们使用原子视觉动作数据集获得定量结果。尽管该模型没有报告任务中最佳性能,但我们认为这是一个有趣的发现。我们表明,可以使用多模式模型来解决它不设计的任务。最后,我们认为,这一研究可能会导致MDETR在其他下游任务中的概括。
During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-end multi-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, and visual question answering. One remarkable aspect of the model is the capacity to infer over classes that it was not previously trained for. In this work we explore the use of MDETR in a new task, action detection, without any previous training. We obtain quantitative results using the Atomic Visual Actions dataset. Although the model does not report the best performance in the task, we believe that it is an interesting finding. We show that it is possible to use a multi-modal model to tackle a task that it was not designed for. Finally, we believe that this line of research may lead into the generalization of MDETR in additional downstream tasks.