论文标题
VIMA:通用机器人操作多模式提示
VIMA: General Robot Manipulation with Multimodal Prompts
论文作者
论文摘要
基于及时的学习已成为自然语言处理的成功范式,可以指示单个通用语言模型执行输入提示指定的任何任务。然而,机器人技术中的任务规范有多种形式,例如模仿单次演示,按照语言说明以及达到视觉目标。它们通常被认为是不同的任务,并通过专业模型来解决。我们表明,可以用多模式提示来表达各种各样的机器人操纵任务,并交织文本和视觉令牌。因此,我们开发了一个新的仿真基准,该基准由数千个具有多模式提示的程序生成的桌面任务,600K+用于模仿学习的专家轨迹以及用于系统概括的四级评估协议。我们设计了一个基于变压器的机器人代理VIMA,该机器人可以在自动加压上处理这些提示并输出运动动作。 VIMA具有可实现强大模型可伸缩性和数据效率的食谱。在使用相同的培训数据的情况下,它以最难的零弹性概括设置的替代设计优于$ 2.9 \ times $的任务成功率。凭借$ 10 \ times $ $的培训数据,VIMA仍然比最佳竞争变体的$ 2.7 \ times $ $。代码和视频演示可从https://vimalabs.github.io/获得
Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/