索莫斯：三星开放MOS数据集用于评估神经文本到语音综合

论文标题

索莫斯：三星开放MOS数据集用于评估神经文本到语音综合

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

论文作者

Maniati, Georgia, Vioni, Alexandra, Ellinas, Nikolaos, Nikitaras, Karolos, Klapsas, Konstantinos, Sung, June Sig, Jho, Gunu, Chalamandaris, Aimilios, Tsiakoulis, Pirros

论文摘要

在这项工作中，我们介绍了SOMOS数据集，这是第一个大规模的意见分数（MOS）数据集，该数据集由完全神经文本到语音（TTS）样本组成。它可以用于训练专注于现代合成器评估的自动MOS预测系统，并可以刺激声学模型评估的进步。它由LJ语音语音的20k合成话语组成，LJ语音是一个公共领域的语音数据集，是建立神经声学模型和声码器的常见基准。来自200 TTS系统（包括香草神经声学模型以及允许韵律变化的模型）产生的话语。 LPCNET VOCODER用于所有系统，因此样品的变化仅取决于声学模型。合成的话语提供了平衡，足够的域和长度覆盖范围。我们对3个英国亚马逊机械土耳其人地点进行了MOS自然评估，并共享实践，从而为这项任务提供可靠的众包注释。我们在SOMOS数据集上提供了最先进的MOS预测模型的基线结果，并显示了分配用于评估TTS话语时所面临的局限性。

In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the samples' variation depends only on the acoustic models. The synthesized utterances provide balanced and adequate domain and length coverage. We collect MOS naturalness evaluations on 3 English Amazon Mechanical Turk locales and share practices leading to reliable crowdsourced annotations for this task. We provide baseline results of state-of-the-art MOS prediction models on the SOMOS dataset and show the limitations that such models face when assigned to evaluate TTS utterances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题