论文标题

声音综合中深度学习的音频表示:评论

Audio representations for deep learning in sound synthesis: A review

论文作者

Natsiou, Anastasia, O'Leary, Sean

论文摘要

深度学习算法的兴起使许多研究人员退出使用经典的信号处理方法来发电。深度学习模型已经实现了表达性的语音综合,现实的声音纹理和虚拟仪器的音符。但是,最合适的深度学习体系结构仍在研究中。体系结构的选择与音频表示紧密结合。声音的原始波形对于深度学习模型而言无法有效地处理 - 复杂性增加了训练时间和计算成本。同样,它并不能以其感知方式代表声音。因此,在许多情况下,使用上采样,功能萃取,甚至通过采用更高级别的波形说明,将原始音频转化为一种压缩,更有意义的形式。此外,已经研究了以选择形式的条件,其他条件表示,不同的模型架构以及许多用于评估重建声音的指标。本文概述了使用深度学习适用于声音综合的音频表示。此外,它介绍了使用深度学习模型开发和评估声音合成结构的最重要方法,始终取决于音频表示。

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源