扬声器数量未知数的粗到最新递归语音分离

论文标题

扬声器数量未知数的粗到最新递归语音分离

Coarse-to-Fine Recursive Speech Separation for Unknown Number of Speakers

论文作者

Jin, Zhenhao, Hao, Xiang, Su, Xiangdong

论文摘要

绝大多数语音分离方法都假定说话者的数量已提前知道，因此它们特定于说话者的数量。相比之下，一个更现实和挑战的任务是将说话者数量未知的混合物分开。本文用未知数量的说话者作为多通源提取问题来制定语音分离，并提出了一种粗到精细的递归语音分离方法。该方法包括两个阶段，即递归提示提取和目标扬声器提取。递归提示阶段确定需要执行多少计算迭代，并通过监视混合物中的统计数据来输出粗略的提示语音。随着递归迭代的数量增加，失真的积累最终会出现在提取的语音和提醒中。因此，在第二阶段，我们使用目标扬声器提取网络根据粗糙的目标提示和原始的无失真混合物提取精细的语音。实验表明，该提出的方法在WSJ0数据集上存档的最先进的性能具有不同数量的扬声器。此外，它可以很好地推广到看不见的大量扬声器。

The vast majority of speech separation methods assume that the number of speakers is known in advance, hence they are specific to the number of speakers. By contrast, a more realistic and challenging task is to separate a mixture in which the number of speakers is unknown. This paper formulates the speech separation with the unknown number of speakers as a multi-pass source extraction problem and proposes a coarse-to-fine recursive speech separation method. This method comprises two stages, namely, recursive cue extraction and target speaker extraction. The recursive cue extraction stage determines how many computational iterations need to be performed and outputs a coarse cue speech by monitoring statistics in the mixture. As the number of recursive iterations increases, the accumulation of distortion eventually comes into the extracted speech and reminder. Therefore, in the second stage, we use a target speaker extraction network to extract a fine speech based on the coarse target cue and the original distortionless mixture. Experiments show that the proposed method archived state-of-the-art performance on the WSJ0 dataset with a different number of speakers. Furthermore, it generalizes well to an unseen large number of speakers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题