重新考虑图像合成的矢量定量令牌的目标

论文标题

重新考虑图像合成的矢量定量令牌的目标

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

论文作者

Gu, Yuchao, Wang, Xintao, Ge, Yixiao, Shan, Ying, Qie, Xiaohu, Shou, Mike Zheng

论文摘要

矢量定量（基于VQ）的生成模型通常由两个基本组件组成，即VQ Tokenizers和Generative Transformers。先前的研究重点是改善VQ引物器的重建保真度，但很少研究重建的改进如何影响生成变压器的产生能力。在本文中，我们出人意料地发现，改善VQ引物器的重建保真度并不一定会改善一代。取而代之的是，学习在VQ引物器中压缩语义特征可显着提高生成变压器捕获纹理和结构的能力。因此，我们重点介绍了VQ令牌的两个相互竞争的目标用于图像合成：语义压缩和细节保存。与以前只能追求更好细节保存的工作不同，我们建议使用两个学习阶段的语义量化的Gan（SEQ-GAN）来平衡两个目标。在第一阶段，我们提出了一种语义增强的感知损失，以获得更好的语义压缩。在第二阶段，我们修复编码器和代码手册，但增强并填充解码器以获得更好的细节保存。提出的SEQ-GAN极大地改善了基于VQ的生成模型，并超过了无条件和条件图像产生的GAN和扩散模型。我们的SEQ-GAN（364M）在256x256 Imagenet生成上达到了Frechet Inception Inteption距离（FID）为6.25，IS IS得分为140.9，这是对VIT-VQGAN（714m）的显着改善，可获得11.2 FID和97.2 IS。

Vector-Quantized (VQ-based) generative models usually consist of two basic components, i.e., VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper, we surprisingly find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. Instead, learning to compress semantic features within VQ tokenizers significantly improves generative transformers' ability to capture textures and structures. We thus highlight two competing objectives of VQ tokenizers for image synthesis: semantic compression and details preservation. Different from previous work that only pursues better details preservation, we propose Semantic-Quantized GAN (SeQ-GAN) with two learning phases to balance the two objectives. In the first phase, we propose a semantic-enhanced perceptual loss for better semantic compression. In the second phase, we fix the encoder and codebook, but enhance and finetune the decoder to achieve better details preservation. The proposed SeQ-GAN greatly improves VQ-based generative models and surpasses the GAN and Diffusion Models on both unconditional and conditional image generation. Our SeQ-GAN (364M) achieves Frechet Inception Distance (FID) of 6.25 and Inception Score (IS) of 140.9 on 256x256 ImageNet generation, a remarkable improvement over VIT-VQGAN (714M), which obtains 11.2 FID and 97.2 IS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题