使用感知引导的离散表示形式的音频语言建模

论文标题

使用感知引导的离散表示形式的音频语言建模

Audio Language Modeling using Perceptually-Guided Discrete Representations

论文作者

Kreuk, Felix, Taigman, Yaniv, Polyak, Adam, Copet, Jade, Synnaeve, Gabriel, Défossez, Alexandre, Adi, Yossi

论文摘要

在这项工作中，我们研究了音频语言建模的任务，其中我们旨在学习可以用于生成和完成的音频模型。我们使用最先进的感知引导的音频压缩模型来编码音频以离散表示。接下来，我们使用这些表示形式培训基于变压器的因果语言模型。在推论时，我们通过将音频提示编码为离散序列，将其馈送到音频语言模型，从模型中采样并合成相应的时域信号来执行音频自动完成。我们评估了我们方法在Audioset上生成的样品的质量，Audioset是迄今为止最大的一般音频数据集，并表明它优于评估的基线音频编码器。我们还提供广泛的分析，以更好地了解音频质量和语言模型功能之间的权衡。样本：链接。

In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform audio auto-completion by encoding an audio prompt as a discrete sequence, feeding it to the audio language model, sampling from the model, and synthesizing the corresponding time-domain signal. We evaluate the quality of samples generated by our method on Audioset, the largest dataset for general audio to date, and show that it is superior to the evaluated baseline audio encoders. We additionally provide an extensive analysis to better understand the trade-off between audio-quality and language-modeling capabilities. Samples:link.

下载PDF全文

下载文献需遵守相关版权规定

论文标题