猎犬：学习内容风格的表示形式为令牌级的两部分图

论文标题

猎犬：学习内容风格的表示形式为令牌级的两部分图

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

论文作者

Yin, Dacheng, Ren, Xuanchi, Luo, Chong, Wang, Yuwang, Xiong, Zhiwei, Zeng, Wenjun

论文摘要

本文讨论了对内容式分解表示形式的无监督学习。我们首先给出样式的定义，然后将内容式表示形式建模为令牌级的两部分图。提议一个无监督的框架，名为“猎犬”，以学习此类表示。首先，采用跨意义模块从输入数据中检索定义为样式的置换不变（P.I.）信息。其次，使用矢量量化（VQ）模块以及人类诱导的约束来产生可解释的内容令牌。最后，在链接键的帮助下，创新的链接注意模块是从分解内容和样式重建数据的解码器。在语音和图像域中评估了拟议的检索器，因此是模态敏锐的。最先进的零声音转换性能证实了我们框架的分离能力。在图像的零件发现任务中，还可以实现最佳性能，从而验证我们表示的解释性。此外，基于生动的零件风格转移质量展示了猎犬支持各种迷人的生成任务的潜力。项目页面https://ydcustc.github.io/retriever-demo/。

This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at https://ydcustc.github.io/retriever-demo/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题