频谱图重建对自动音乐转录的影响：提高转录精度的替代方法

论文标题

频谱图重建对自动音乐转录的影响：提高转录精度的替代方法

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

论文作者

Cheuk, Kin Wai, Luo, Yin-Jyun, Benetos, Emmanouil, Herremans, Dorien

论文摘要

大多数最先进的自动音乐转录（AMT）模型将主要转录任务分解为子任务，例如发作预测和偏移预测，并使用发作和偏移标签训练它们。然后将这些预测连接在一起，并用作将另一个模型与音高标签训练以获得最终转录的输入。我们尝试仅使用音高标签（以及频谱图的重建损失），并探索该模型在不引入监督子任务的情况下可以走多远。在本文中，我们不是旨在实现最新的转录精度，而是探讨了频谱重建对AMT模型的影响。我们提出的模型由两个U-NET组成：第一个U-NET将频谱图转录为后验，第二个U-NET将后验转换回频谱图。在原始光谱图和重建频谱图之间应用重建损失，以将第二个U-NET限制为仅关注重建。我们在三个不同的数据集上训练模型：地图，大师和音乐网。我们的实验表明，与没有重建部分的同一模型相比，相比，添加重建损失通常可以提高注释级的转录精度。此外，它还可以提高框架级的精度比最先进的模型更高。我们的U-NET学到的特征图包含网格样结构（基线模型中不存在），这意味着在存在重建损失的情况下，该模型可能试图沿时间和频率轴计数，从而产生更高的注释级转录精度。

Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A reconstruction loss is applied between the original spectrogram and the reconstructed spectrogram to constrain the second U-net to focus only on reconstruction. We train our model on three different datasets: MAPS, MAESTRO, and MusicNet. Our experiments show that adding the reconstruction loss can generally improve the note-level transcription accuracy when compared to the same model without the reconstruction part. Moreover, it can also boost the frame-level precision to be higher than the state-of-the-art models. The feature maps learned by our U-net contain gridlike structures (not present in the baseline model) which implies that with the presence of the reconstruction loss, the model is probably trying to count along both the time and frequency axis, resulting in a higher note-level transcription accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题