论文标题
端到端扬声器验证的神经PLDA建模
Neural PLDA Modeling for End-to-End Speaker Verification
论文作者
论文摘要
尽管深度学习模型在监督分类问题方面取得了重大进步,但这些模型在诸如扬声器识别之类的非设定验证任务中的应用仅限于推导功能嵌入。最新的基于X-vector PLDA的扬声器验证系统使用基于概率线性判别分析(PLDA)的生成模型来计算验证评分。最近,我们提出了一种在称为神经PLDA(NPLDA)的说话者验证的神经网络方法,其中将生成PLDA模型的似然比评分作为歧视性相似性函数提出,并且使用验证成本优化了得分函数的可学习参数。在本文中,我们将这项工作扩展到以端到端(E2E)方式与NPLDA网络对嵌入神经网络(X-Vector网络)的联合优化。该提出的端到端模型直接从具有验证成本函数的声学特征进行优化,并且在测试期间,该模型直接输出了似然比评分。通过使用NIST扬声器识别评估(SRE)2018和2019年数据集的各种实验,我们表明拟议的E2E模型在X-VORTACERPLDA基线扬声器验证系统上大大改善。
While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end (E2E) fashion. This proposed end-to-end model is optimized directly from the acoustic features with a verification cost function and during testing, the model directly outputs the likelihood ratio score. With various experiments using the NIST speaker recognition evaluation (SRE) 2018 and 2019 datasets, we show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.