使用多头视觉原声记忆来区分均匀的谐波来读取唇读

论文标题

使用多头视觉原声记忆来区分均匀的谐波来读取唇读

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

论文作者

Kim, Minsu, Yeo, Jeong Hun, Ro, Yong Man

论文摘要

识别静音唇部运动的言语被称为唇部阅读，这是一项艰巨的任务，因为1）唇部运动的固有信息不足以完全代表语音，而2）同酚的存在具有相似的唇部运动，并具有不同的发音。在本文中，我们试图通过提出多头视觉原声记忆（MVM）来减轻上述唇部阅读的两个挑战。首先，MVM通过视听数据集对MVM进行了训练，并通过对配对的音频表示表示的相互关系进行建模来记住音频表示。在推理阶段，单独的视觉输入可以通过检查学习的相互关系来从内存中提取保存的音频表示形式。因此，唇读模型可以通过提取的音频表示形式补充不足的视觉信息。其次，MVM由多头关键记忆组成，用于保存视觉特征和一个值存储器以保存音频知识，旨在区分均匀的谐波。借助多头关键记忆，MVM从内存中提取可能的候选音频功能，这使唇读模型可以考虑可以从输入唇部运动中表示哪些发音的可能性。这也可以看作是Viseme-to-phoneme一对多映射的明确实现。此外，MVM在多个时间级级别采用在检索记忆并区分均匀词时考虑上下文。广泛的实验结果验证了所提出的方法在唇读和区分谐波方面的有效性。

Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head Visual-audio Memory (MVM). Firstly, MVM is trained with audio-visual datasets and remembers audio representations by modelling the inter-relationships of paired audio-visual representations. At the inference stage, visual input alone can extract the saved audio representation from the memory by examining the learned inter-relationships. Therefore, the lip reading model can complement the insufficient visual information with the extracted audio representations. Secondly, MVM is composed of multi-head key memories for saving visual features and one value memory for saving audio knowledge, which is designed to distinguish the homophenes. With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement. This also can be viewed as an explicit implementation of the one-to-many mapping of viseme-to-phoneme. Moreover, MVM is employed in multi-temporal levels to consider the context when retrieving the memory and distinguish the homophenes. Extensive experimental results verify the effectiveness of the proposed method in lip reading and in distinguishing the homophenes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题