论文标题

音乐的对比音频学习

Contrastive Audio-Language Learning for Music

论文作者

Manco, Ilaria, Benetos, Emmanouil, Quinton, Elio, Fazekas, György

论文摘要

作为人类已知的最直观的界面之一,自然语言有可能调解许多涉及人类计算机互动的任务,尤其是在音乐信息检索等以应用程序为中心的领域。在这项工作中,我们探索了跨模式学习,以试图在音乐领域弥合音频和语言。为此,我们提出了Muscall,这是音乐对比的音频学习框架。我们的方法由双重编码器架构组成,该体系结构学习了成对的音乐音频和描述性句子之间的对齐方式,生成可用于文本到原告和音频到文本检索的多模式嵌入。多亏了此属性,Muscall几乎可以转移到任何可以作为基于文本检索的任务。我们的实验表明,我们的方法在检索与文本描述匹配的音频时的性能要比基线要好得多,并且相反,与音频查询匹配的文本。我们还证明,我们的模型的多模式对齐能力可以成功扩展到零摄像转移方案,以进行类型分类和在两个公共数据集上自动标记。

As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this end, we propose MusCALL, a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio-to-text retrieval out-of-the-box. Thanks to this property, MusCALL can be transferred to virtually any task that can be cast as text-based retrieval. Our experiments show that our method performs significantly better than the baselines at retrieving audio that matches a textual description and, conversely, text that matches an audio query. We also demonstrate that the multimodal alignment capability of our model can be successfully extended to the zero-shot transfer scenario for genre classification and auto-tagging on two public datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源