论文标题

部分可观测时空混沌系统的无模型预测

ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

论文作者

Zhuo, Terry Yue, Liao, Yaqing, Lei, Yuecheng, Qu, Lizhen, de Melo, Gerard, Chang, Xiaojun, Ren, Yazhou, Xu, Zenglin

论文摘要

我们介绍了Vilpact,这是一种用于人类活动计划的新型视觉语言基准。它是为一项任务而设计的,其中体现的AI代理可以根据视频剪辑了解其最初的活动和文本意图,并预测人类的未来动作。该数据集由\ charades的2.9k视频组成,该视频通过众包,多选择问题测试集和四个强大的基线组成。其中一种基线实现了基于多模式知识库(MKB)的神经符号方法,而其他基本方法是根据最近的最新方法(SOTA)方法适应的深层生成模型。根据我们广泛的实验,主要挑战是组成概括和有效利用来自两种方式的信息。

We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源