论文标题
矢量定量的音色表示
Vector-Quantized Timbre Representation
论文作者
论文摘要
Timbre是一组感知属性,可以标识不同类型的声源。尽管其定义通常是难以捉摸的,但可以从信号处理的角度看出,这是所有独立于俯仰和响度独立的光谱特征。一些作品通过分析不同仪器的特征关系研究了高级音色综合,但是声学特性仍然纠缠并与单个声音结合。本文通过学习具有一系列生成特征的频谱特性的近似分解,以更灵活地合成单个音色。我们介绍了一个具有离散潜在空间的自动编码器,该空间与响度散开,以了解给定音色分布的量化表示。可以通过将任何可变的长度输入信号编码为根据学习的音色解码的量化潜在特征来执行音色传输。我们详细介绍了在管弦乐器乐器和歌声之间翻译音频的结果,以及从声音模仿到乐器的转移,作为一种直观的方式,以驱动声音综合。此外,我们可以将离散的潜在空间映射到声学描述符,并直接执行基于描述符的合成。
Timbre is a set of perceptual attributes that identifies different types of sound sources. Although its definition is usually elusive, it can be seen from a signal processing viewpoint as all the spectral features that are perceived independently from pitch and loudness. Some works have studied high-level timbre synthesis by analyzing the feature relationships of different instruments, but acoustic properties remain entangled and generation bound to individual sounds. This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. Timbre transfer can be performed by encoding any variable-length input signals into the quantized latent features that are decoded according to the learned timbre. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments as an intuitive modality to drive sound synthesis. Furthermore, we can map the discrete latent space to acoustic descriptors and directly perform descriptor-based synthesis.