论文标题
歌手:嘴唇和声音的视听同步模型
VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
论文作者
论文摘要
在本文中,我们解决了包含人脸和声音的视频中的唇彩同步问题。我们的方法基于确定视频中的嘴唇运动和声音是否同步,具体取决于其视听对应得分。我们提出了一个基于视听的跨模式变压器模型,该模型在标准的唇读语音基准数据集LRS2上胜过视频 - 视觉同步任务中的几个基线模型。尽管现有的方法主要集中于语音视频中的唇部同步,但我们也考虑了歌声的特殊情况。由于持续的元音声音,唱歌声音是同步的更具挑战性的用例。我们还研究了在唱歌语音的背景下在语音数据集中训练的LIP同步模型的相关性。最后,我们使用在唱歌语音分离任务中通过唇部同步模型学到的冷冻视觉特征,以优于训练有素的端到端的基线音频视觉模型。演示,源代码和预培训模型可在https://ipcv.github.io/vocalist/上找到
In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/