改进的深度神经网络，用于在不同的时间尺度上建模扬声器特征

论文标题

改进的深度神经网络，用于在不同的时间尺度上建模扬声器特征

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

论文作者

Gu, Bin, Guo, Wu

论文摘要

本文介绍了一种基于卷积神经网络（CNN）的深入嵌入学习方法，用于与文本无关的说话者验证。提出了两个改进的X矢量嵌入学习：（1）在框架级别的层中采用多尺度卷积（MSCNN），以捕获不同接收领域的互补说话者信息。（2）在合并层中应用了Baum-Welch统计注意力（BWSA）机制，该机制可以在时间合并层中整合更有用的长期扬声器特征。实验在NIST SRE16评估集上进行。结果证明了MSCNN的有效性，并表明提出的BWSA可以进一步提高DNN嵌入系统的性能

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields. (2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system

下载PDF全文

下载文献需遵守相关版权规定

论文标题