无与伦比的自我监督学习语音级语音表示

论文标题

无与伦比的自我监督学习语音级语音表示

Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

论文作者

Cho, Jaejin, Pappagari, Raghavendra, Żelasko, Piotr, Moro-Velazquez, Laureano, Villalba, Jesús, Dehak, Najim

论文摘要

考虑到大量未标记的语音数据和高标签成本，无监督的学习方法对于更好的系统开发至关重要。最成功的方法之一是对比度自制方法，它需要负采样：采样替代样品与当前样品（锚）对比。但是，很难确保所有负样本属于与没有标签的锚类别不同的类别。本文在未标记的语音语料库上应用了一种非对抗性的自我监督学习方法来学习话语级的嵌入。我们使用没有标签（Dino）的蒸馏，在计算机视觉中提出，并将其改编为语音域。与对比度方法不同，Dino不需要负采样。根据说话者的验证和情感识别对这些嵌入进行了评估。在说话者验证中，无监督的恐龙与余弦评分嵌入了voxceleb1测试试验中的4.38％EER。在EER中，这表现优于最佳的对比度自我观察方法。不需要扬声器标签的迭代伪标记训练管道将EER进一步提高到1.89％。在情感识别中，Dino嵌入的IEMOCAP，CREMA-D和MSP播客的Micro-F1分别为60.87、79.21和56.98％。结果暗示着恐龙嵌入到不同语音应用中的一般性。

Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised learning method on an unlabeled speech corpus to learn utterance-level embeddings. We used DIstillation with NO labels (DINO), proposed in computer vision, and adapted it to the speech domain. Unlike the contrastive methods, DINO does not require negative sampling. These embeddings were evaluated on speaker verification and emotion recognition. In speaker verification, the unsupervised DINO embedding with cosine scoring provided 4.38% EER on the VoxCeleb1 test trial. This outperforms the best contrastive self-supervised method by 40% relative in EER. An iterative pseudo-labeling training pipeline, not requiring speaker labels, further improved the EER to 1.89%. In emotion recognition, the DINO embedding performed 60.87, 79.21, and 56.98% in micro-f1 score on IEMOCAP, Crema-D, and MSP-Podcast, respectively. The results imply the generality of the DINO embedding to different speech applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题