T-向量：使用分层变压器模型弱监督的说话者识别

论文标题

T-向量：使用分层变压器模型弱监督的说话者识别

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

论文作者

Shi, Yanpei, Chen, Mingjie, Huang, Qiang, Hain, Thomas

论文摘要

在不知道说话者在录音中的声音在哪里可以识别多个扬声器是一项具有挑战性的任务。本文提出了一个带有变压器编码和内存机制的分层网络，以解决此问题。所提出的模型包含一个帧级编码器和细分级编码器，它们都使用了变压器编码器块。当输入话语包含多个扬声器时，变压器结构中的多头注意机制可以更好地捕获不同的扬声器属性。框架级编码器中使用的内存机制可以建立一个经常性连接，以更好地捕获长期扬声器功能。实验是基于总机蜂窝部分（SWBC）和Voxceleb1数据集在人造数据集上进行的。在不同的数据构建方案（Concat和重叠）中，所提出的模型显示出更好的性能与四个强基础相比，与H-媒介和S-矢量相比，相对改善的相对改善为13.3％和10.5％。与不使用记忆机制相比，使用记忆机制的使用可能达到10.6％和7.7％。

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block. The multi-head attention mechanism in the transformer structure could better capture different speaker properties when the input utterance contains multiple speakers. The memory mechanism used in the frame-level encoders can build a recurrent connection that better capture long-term speaker features. The experiments are conducted on artificial datasets based on the Switchboard Cellular part1 (SWBC) and Voxceleb1 datasets. In different data construction scenarios (Concat and Overlap), the proposed model shows better performance comparaing with four strong baselines, reaching 13.3% and 10.5% relative improvement compared with H-vectors and S-vectors. The use of memory mechanism could reach 10.6% and 7.7% relative improvement compared with not using memory mechanism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题