评估声音嵌入的可靠性

论文标题

评估声音嵌入的可靠性

Evaluating the reliability of acoustic speech embeddings

论文作者

Algayres, Robin, Zaiem, Mohamed Salah, Sagot, Benoit, Dupoux, Emmanuel

论文摘要

语音嵌入是可变长度语音序列的固定尺寸声学表示。它们越来越多地用于各种任务，从信息检索到无监督的术语发现和语音细分。但是，目前尚无明确的方法来以任务中性的方式比较或优化这些嵌入的质量。在这里，我们在17种嵌入方法中有系统地比较了两个流行的指标，ABX歧视和平均平均精度（MAP），从监督到完全无监督的5种语言，以及使用不同的损失功能（自动编码器，通信自动编码器，暹罗语）。然后，我们使用ABX和MAP来预测新的下游任务的性能：对给定语料库中语音段频率的无监督估计。我们发现总体而言，ABX和MAP相互关联，并与频率估计相关。但是，在语言和/或嵌入方法的细粒度区分中出现了实质性差异。这使得目前不现实地提出一种与任务无关的银色子弹方法来计算语音嵌入的内在质量。需要对目前用于评估此类嵌入的指标进行更详细的分析。

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题