论文标题
空间和光谱深度注意融合,用于使用深层嵌入功能的多通道语音分离
Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features
论文作者
论文摘要
多通道深聚类(MDC)获得了语音分离的良好表现。但是,MDC仅将空间特征应用于附加信息。因此,很难学习空间和光谱特征之间的相互关系。此外,MDC的训练目标是在嵌入矢量而不是实际分离来源时定义的,这可能会损害分离性能。在这项工作中,我们提出了一种深切的注意融合方法,以动态控制光谱和空间特征的权重,并将其深入混合。此外,为了解决MDC的训练目标问题,实际分离的来源被用作培训目标。具体而言,我们将深层聚类网络应用于提取深层嵌入功能。另一个受监督的网络无需使用无监督的K-均值聚类来估算二进制面具,而是从这些深层嵌入功能中学习了软面膜。我们的实验是在WSJ0-2MIX数据集的空间化混响版本上进行的。实验结果表明,所提出的方法的表现优于MDC基线,甚至比Oracle理想的二进制掩码(IBM)更好。
Multi-channel deep clustering (MDC) has acquired a good performance for speech separation. However, MDC only applies the spatial features as the additional information. So it is difficult to learn mutual relationship between spatial and spectral features. Besides, the training objective of MDC is defined at embedding vectors, rather than real separated sources, which may damage the separation performance. In this work, we propose a deep attention fusion method to dynamically control the weights of the spectral and spatial features and combine them deeply. In addition, to solve the training objective problem of MDC, the real separated sources are used as the training objectives. Specifically, we apply the deep clustering network to extract deep embedding features. Instead of using the unsupervised K-means clustering to estimate binary masks, another supervised network is utilized to learn soft masks from these deep embedding features. Our experiments are conducted on a spatialized reverberant version of WSJ0-2mix dataset. Experimental results show that the proposed method outperforms MDC baseline and even better than the oracle ideal binary mask (IBM).