论文标题
Anyface:自由风格的文本对面合成和操纵
AnyFace: Free-style Text-to-Face Synthesis and Manipulation
论文作者
论文摘要
现有的文本对图像合成方法通常仅适用于培训数据集中的单词。但是,人的面孔是如此可变,可以用有限的单词来描述。因此,本文提出了第一种自由式的文本到脸型方法,即Anyface,可以实现更广泛的开放世界应用,例如Metaverse,社交媒体,化妆品,取证等。Anyface具有一个新颖的两流图像综合和操纵的两流框架,鉴于人脸的任意描述。具体而言,一个流执行文本到面的生成,另一个流进行面部图像重建。面部文本和图像特征是使用剪辑(对比语言图像预训练)编码器提取的。并且协作交叉模态蒸馏(CMD)模块旨在使这两个流的语言和视觉特征对齐。此外,开发出多种三胞胎损失(DT损失),以模拟细粒度的特征并改善面部多样性。对多模式Celeba-HQ和Celebatext-HQ进行了广泛的实验,表现出与最新方法相比。 Anyface可以实现高质量,高分辨率和高多样性的面部综合和操纵结果,而不会对输入标题的数量和内容有任何约束。
Existing text-to-image synthesis methods generally are only applicable to words in the training dataset. However, human faces are so variable to be described with limited words. So this paper proposes the first free-style text-to-face method namely AnyFace enabling much wider open world applications such as metaverse, social media, cosmetics, forensics, etc. AnyFace has a novel two-stream framework for face image synthesis and manipulation given arbitrary descriptions of the human face. Specifically, one stream performs text-to-face generation and the other conducts face image reconstruction. Facial text and image features are extracted using the CLIP (Contrastive Language-Image Pre-training) encoders. And a collaborative Cross Modal Distillation (CMD) module is designed to align the linguistic and visual features across these two streams. Furthermore, a Diverse Triplet Loss (DT loss) is developed to model fine-grained features and improve facial diversity. Extensive experiments on Multi-modal CelebA-HQ and CelebAText-HQ demonstrate significant advantages of AnyFace over state-of-the-art methods. AnyFace can achieve high-quality, high-resolution, and high-diversity face synthesis and manipulation results without any constraints on the number and content of input captions.