论文标题
视频预处理(VPT):通过观看未标记的在线视频来学习采取行动
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
论文作者
论文摘要
在嘈杂的互联网规模数据集上进行了预处理,已对具有广泛,一般能力的文本,图像和其他方式的培训模型进行了大量研究。但是,对于许多顺序决策域,例如机器人技术,视频游戏和计算机使用,公开可用的数据不包含以相同方式训练行为先验所需的标签。我们通过半监督的模仿学习将互联网规模的预处理扩展到顺序的决策域,其中代理通过观看在线未标记的视频来学习采取行动。具体来说,我们表明,借助少量标记的数据,我们可以训练一个足够准确的反向动力学模型,以便标记一个巨大的未标记在线数据来源(这里,在此,在线玩游戏的人的在线视频),然后我们可以从中训练一般的行为。尽管使用了本地人类界面(鼠标和键盘为20Hz),但我们表明,这种行为先验具有非平凡的零射击功能,并且可以通过模仿学习和强化学习,可以对其进行微调,以通过强化学习从抓挠学习中学习的硬性探索任务。对于许多任务,我们的模型都表现出人类水平的性能,我们是第一个报告可以制作钻石工具的计算机代理,这些工具可以花费超过20分钟(24,000个环境动作)的游戏玩法才能完成。
Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.