论文标题

每个人都在说话:让我按照你的意愿说话

Everybody's Talkin': Let Me Talk as You Want

论文作者

Song, Linsen, Wu, Wayne, Qian, Chen, He, Ran, Loy, Chen Change

论文摘要

我们提出了一种方法来编辑目标肖像镜头,以获取一系列音频作为综合照片真实视频的输入。此方法是唯一的,因为它具有高度动态性。它不假定特定于人的渲染网络,但能够将任意源音频转换为任意视频输出。我们首先将每个目标视频框架分配到正交参数空间中,即表达式,几何和姿势,而不是直接将每个目标视频框架分配到视频中,而不是直接将每个目标视频框架分配到视频中,而是通过单眼3D脸部重建,我们首先将每个目标视频框架分解为正交参数空间,而是将每个目标视频框架直接分配到视频中。接下来,引入一个经常性网络,以将源音频转换为主要与音频内容相关的表达参数。然后,使用音频翻译的表达参数用于在每个视频框架中综合一个真实的人类主题,而口腔区域的运动精确地映射到源音频。保留了目标人肖像的几何形状和姿势参数,因此保留了原始录像的上下文。最后,我们介绍了一个新颖的视频渲染网络和一种动态编程方法,以构建时间连贯和照片现实的视频。广泛的实验证明了我们方法比现有方法的优越性。我们的方法是端到端可学习的,可以在源音频中进行语音变化。

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源