论文标题
通过视频流中的视觉语言模型和本体学系统了解机器人和人类操纵任务中的上下文
Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream
论文作者
论文摘要
在专门的操纵环境下,有意地进行了日常生活中的操纵任务,例如倒水。能够在这些日常生活活动(ADL)中处理上下文知识可以帮助我们理解操纵意图,这对于智能机器人在各种操纵动作之间平稳过渡至关重要。在本文中,为了建模预期的操纵概念,我们在机器人和人类操纵的严格约束知识领域下介绍了一个视觉数据集,在这种情况下,操纵概念和关系以分类方式由本体学系统存储。此外,我们提出了一种计划,以产生视觉关注和充满常识知识的不断发展的知识图的组合。我们的计划与现实世界相机流合作,并将基于注意力的视觉语言模型与本体系统融合在一起。实验结果表明,所提出的方案可以成功地代表机器人和人类的预期对象操纵程序的演变。提出的计划允许机器人通过观看实时视频来模仿类似人类的故意行为。我们旨在进一步开发这种计划,以在人类机器人相互作用中为现实世界的机器人智能发展。
Manipulation tasks in daily life, such as pouring water, unfold intentionally under specialized manipulation contexts. Being able to process contextual knowledge in these Activities of Daily Living (ADLs) over time can help us understand manipulation intentions, which are essential for an intelligent robot to transition smoothly between various manipulation actions. In this paper, to model the intended concepts of manipulation, we present a vision dataset under a strictly constrained knowledge domain for both robot and human manipulations, where manipulation concepts and relations are stored by an ontology system in a taxonomic manner. Furthermore, we propose a scheme to generate a combination of visual attentions and an evolving knowledge graph filled with commonsense knowledge. Our scheme works with real-world camera streams and fuses an attention-based Vision-Language model with the ontology system. The experimental results demonstrate that the proposed scheme can successfully represent the evolution of an intended object manipulation procedure for both robots and humans. The proposed scheme allows the robot to mimic human-like intentional behaviors by watching real-time videos. We aim to develop this scheme further for real-world robot intelligence in Human-Robot Interaction.