论文标题
部分可观测时空混沌系统的无模型预测
Separate What You Describe: Language-Queried Audio Source Separation
论文作者
论文摘要
在本文中,我们介绍了语言引人入胜的音频源分离(LASS)的任务,该任务旨在根据目标源的自然语言查询(例如,“一个男人告诉一个笑话,然后人们笑”)。午后的一个独特挑战与自然语言描述的复杂性及其与音频源的关系有关。为了解决这个问题,我们提出了努力,这是一个端到端神经网络,该网络被学会地共同处理声学和语言信息,并将与语言查询一致的目标源与音频混合物分开。我们使用从AudioCaps数据集创建的数据集评估了我们提出的系统的性能。实验结果表明,LASS-NET比基线方法实现了相当大的改善。此外,我们观察到,当使用多种人类宣传的描述作为查询时,LASS-NET可以实现有希望的概括结果,这表明其在现实世界中的潜在使用。分开的音频样本和源代码可在https://liuxubo717.github.io/lass-demopage上找到。
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.