论文标题
我听到了您的真实颜色:图像引导的音频产生
I Hear Your True Colors: Image Guided Audio Generation
论文作者
论文摘要
我们提出了IM2WAV,这是一个由开放域音频生成系统引导的图像。给定输入图像或一系列图像,IM2WAV会生成语义相关的声音。 IM2WAV基于两个变压器语言模型,该模型通过从基于VQ-VAE的模型获得的分层离散音频表示。我们首先使用语言模型制作低级音频表示。然后,我们使用其他语言模型来对音频令牌进行示例样本,以生成高保真音频样本。我们将富含预训练的剪辑(对比语言图像预训练)的丰富语义作为视觉表示,以调节语言模型。此外,为了将生成过程转向调节图像,我们采用了无分类器指导方法。结果表明,IM2WAV在忠诚度和相关性评估指标中都显着胜过评估的基准。此外,我们提供一项消融研究,以更好地评估每个方法组件对整体性能的影响。最后,为了更好地评估图像到原模型,我们提出了一个室外图像数据集,称为ImageHear。 ImageHear可以用作评估未来图像与原模型的基准。可以在手稿中找到样品和代码。
We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound. Im2Wav is based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model. We first produce a low-level audio representation using a language model. Then, we upsample the audio tokens using an additional language model to generate a high-fidelity audio sample. We use the rich semantics of a pre-trained CLIP (Contrastive Language-Image Pre-training) embedding as a visual representation to condition the language model. In addition, to steer the generation process towards the conditioning image, we apply the classifier-free guidance method. Results suggest that Im2Wav significantly outperforms the evaluated baselines in both fidelity and relevance evaluation metrics. Additionally, we provide an ablation study to better assess the impact of each of the method components on overall performance. Lastly, to better evaluate image-to-audio models, we propose an out-of-domain image dataset, denoted as ImageHear. ImageHear can be used as a benchmark for evaluating future image-to-audio models. Samples and code can be found inside the manuscript.