现在每个人签名：将口语翻译成照片现实的手语视频

论文标题

现在每个人签名：将口语翻译成照片现实的手语视频

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

论文作者

Saunders, Ben, Camgoz, Necati Cihan, Bowden, Richard

论文摘要

为了真正被聋人社区所理解和接受，自动手语制作（SLP）系统必须生成一个逼真的签名者。事实证明，基于图形化的化身的先前方法已证明不受欢迎，而最近产生骨骼姿势序列的神经SLP作品已证明对聋人观众不了解。在本文中，我们提出了Signgan，这是第一个直接从口语中生成照片真实连续语言视频的SLP模型。我们采用具有混合密度网络（MDN）公式的变压器体系结构来处理从口语到骨骼姿势的翻译。然后引入姿势条件的人类合成模型，以从骨骼姿势序列中生成光真逼真的手语视频。这允许直接从书面文本翻译的符号视频的照片生成。我们进一步提出了一种新型基于关键点的损失功能，该功能可显着提高合成手段图像的质量，该功能在关键点空间中运行，以避免由运动模糊引起的问题。此外，我们还引入了一种可控视频生成的方法，为大型的，多样化的手语数据集提供了培训，并提供了在推理时控制签名者外观的能力。使用从广播录像中提取的八个不同手语解释者的数据集，我们表明，Signgan明显优于所有基线方法，用于定量指标和人类知觉研究。

To be truly understandable and accepted by Deaf communities, an automatic Sign Language Production (SLP) system must generate a photo-realistic signer. Prior approaches based on graphical avatars have proven unpopular, whereas recent neural SLP works that produce skeleton pose sequences have been shown to be not understandable to Deaf viewers. In this paper, we propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language. We employ a transformer architecture with a Mixture Density Network (MDN) formulation to handle the translation from spoken language to skeletal pose. A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence. This allows the photo-realistic production of sign videos directly translated from written text. We further propose a novel keypoint-based loss function, which significantly improves the quality of synthesized hand images, operating in the keypoint space to avoid issues caused by motion blur. In addition, we introduce a method for controllable video generation, enabling training on large, diverse sign language datasets and providing the ability to control the signer appearance at inference. Using a dataset of eight different sign language interpreters extracted from broadcast footage, we show that SignGAN significantly outperforms all baseline methods for quantitative metrics and human perceptual studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题