论文标题
带有检索的扩散模型的文本指导的艺术图像合成
Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models
论文作者
论文摘要
新型体系结构最近改善了生成图像的合成,从而在各种任务中促进了出色的视觉质量。特别值得注意的是``Ai-Art''的领域,随着功能强大的多模型(例如剪辑)的出现,它的生长始终是前所未有的。通过结合语音和图像综合模型,已经建立了所谓的``提示工程'',其中使用精心选择和组成的句子在合成图像中实现某种视觉样式。在本说明中,我们提出了一种基于检索的扩散模型(RDMS)的替代方法。在RDMS中,在每个训练实例的训练过程中,从外部数据库中检索了一组最近的邻居,而扩散模型则根据这些内容丰富的样本进行条件。在推理(采样)期间,我们将检索数据库替换为更专业的数据库,其中包含例如特定视觉样式的图像。这提供了一种新颖的方式,可以在训练后提示一般训练的模型,从而指定特定的视觉风格。如我们的实验所示,这种方法优于指定文本提示中的视觉样式。我们在https://github.com/compvis/latent-diffusion上开放源代码和模型权重。
Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of ``AI-Art'', which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-called ``prompt-engineering'' has become established, in which carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image. In this note, we present an alternative approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains, for example, only images of a particular visual style. This provides a novel way to prompt a general trained model after training and thereby specify a particular visual style. As shown by our experiments, this approach is superior to specifying the visual style within the text prompt. We open-source code and model weights at https://github.com/CompVis/latent-diffusion .