VIMA：通用机器人操作多模式提示

论文标题

VIMA：通用机器人操作多模式提示

VIMA: General Robot Manipulation with Multimodal Prompts

论文作者

Jiang, Yunfan, Gupta, Agrim, Zhang, Zichen, Wang, Guanzhi, Dou, Yongqiang, Chen, Yanjun, Fei-Fei, Li, Anandkumar, Anima, Zhu, Yuke, Fan, Linxi

论文摘要

基于及时的学习已成为自然语言处理的成功范式，可以指示单个通用语言模型执行输入提示指定的任何任务。然而，机器人技术中的任务规范有多种形式，例如模仿单次演示，按照语言说明以及达到视觉目标。它们通常被认为是不同的任务，并通过专业模型来解决。我们表明，可以用多模式提示来表达各种各样的机器人操纵任务，并交织文本和视觉令牌。因此，我们开发了一个新的仿真基准，该基准由数千个具有多模式提示的程序生成的桌面任务，600K+用于模仿学习的专家轨迹以及用于系统概括的四级评估协议。我们设计了一个基于变压器的机器人代理VIMA，该机器人可以在自动加压上处理这些提示并输出运动动作。 VIMA具有可实现强大模型可伸缩性和数据效率的食谱。在使用相同的培训数据的情况下，它以最难的零弹性概括设置的替代设计优于$ 2.9 \ times $的任务成功率。凭借$ 10 \ times $ $的培训数据，VIMA仍然比最佳竞争变体的$ 2.7 \ times $ $。代码和视频演示可从https://vimalabs.github.io/获得

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/

下载PDF全文

下载文献需遵守相关版权规定

论文标题