面对面的语音融合：弥合人的声音特征与面部成像之间的差距

论文标题

面对面的语音融合：弥合人的声音特征与面部成像之间的差距

Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging

论文作者

Bai, Yeqi, Ma, Tao, Wang, Lipo, Zhang, Zhenjie

论文摘要

尽管深度学习技术现在能够产生逼真的图像使人混淆人类，但研究工作正在转向图像的合成，以实现更具体和特定的特定目的。基于语音的人声特征的面部图像生成是如此重要但具有挑战性的任务之一。它是影响产生图像的用例，尤其是公共安全和娱乐业务的关键推动力。现有的Speech2Face问题的解决方案使图像质量有限，并且由于缺乏训练质量数据集和声音功能的适当整合，因此无法保持面部相似性。在本文中，我们调查了这些主要的技术挑战，并提出语音融合，或者简而言之，试图解决面部图像质量的问题以及声音特征领域与现代图像生成模型之间的不良联系。通过在数据模型和培训上采用新的策略，我们通过将单个身份的回忆加倍，并根据VGGFACE分类器的相互信息得分将质量得分提高一倍，并提高质量得分从15到19。

While deep learning technologies are now capable of generating realistic images confusing humans, the research efforts are turning to the synthesis of images for more concrete and application-specific purposes. Facial image generation based on vocal characteristics from speech is one of such important yet challenging tasks. It is the key enabler to influential use cases of image generation, especially for business in public security and entertainment. Existing solutions to the problem of speech2face renders limited image quality and fails to preserve facial similarity due to the lack of quality dataset for training and appropriate integration of vocal features. In this paper, we investigate these key technical challenges and propose Speech Fusion to Face, or SF2F in short, attempting to address the issue of facial image quality and the poor connection between vocal feature domain and modern image generation models. By adopting new strategies on data model and training, we demonstrate dramatic performance boost over state-of-the-art solution, by doubling the recall of individual identity, and lifting the quality score from 15 to 19 based on the mutual information score with VGGFace classifier.

下载PDF全文

下载文献需遵守相关版权规定

论文标题