论文标题
通过辅助自动编码损失的端到端回响分离的失真控制培训
Distortion-controlled Training for End-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss
论文作者
论文摘要
随着端到端神经网络体系结构的最新进展,态环境中语音增强和分离系统的性能已经显着提高。但是,此类系统在混响环境中的性能尚待探索。演讲分离的核心问题是关于培训和评估指标。标准的时间域指标可能会在训练期间引起意外的扭曲,并且由于混响的存在而无法正确评估分离性能。在本文中,我们首先在回响分离中引入“等价轮廓”问题,其中多个输出可以导致公共指标所测量的相同性能。然后,我们研究如何通过辅助自动编码训练(A2T)选择具有较低目标特异性失真的“更好”输出。 A2T假设分离是通过混合信号上的线性操作完成的,并且在直接路径目标信号的自动编码上增加了损耗项,以确保在分离过程中控制直接路径信号上引入的失真。对分离信号质量和语音识别精度的评估表明,A2T能够控制直接路径信号的失真并提高识别精度。
The performance of speech enhancement and separation systems in anechoic environments has been significantly advanced with the recent progress in end-to-end neural network architectures. However, the performance of such systems in reverberant environments is yet to be explored. A core problem in reverberant speech separation is about the training and evaluation metrics. Standard time-domain metrics may introduce unexpected distortions during training and fail to properly evaluate the separation performance due to the presence of the reverberations. In this paper, we first introduce the "equal-valued contour" problem in reverberant separation where multiple outputs can lead to the same performance measured by the common metrics. We then investigate how "better" outputs with lower target-specific distortions can be selected by auxiliary autoencoding training (A2T). A2T assumes that the separation is done by a linear operation on the mixture signal, and it adds an loss term on the autoencoding of the direct-path target signals to ensure that the distortion introduced on the direct-path signals is controlled during separation. Evaluations on separation signal quality and speech recognition accuracy show that A2T is able to control the distortion on the direct-path signals and improve the recognition accuracy.