在学习单通道语音增强的学习复杂光谱映射中利用异质的不确定性

论文标题

在学习单通道语音增强的学习复杂光谱映射中利用异质的不确定性

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Mapping for Single-channel Speech Enhancement

论文作者

Chen, Kuan-Lin, Wong, Daniel D. E., Tan, Ke, Xu, Buye, Kumar, Anurag, Ithapu, Vamsi Krishna

论文摘要

大多数语音增强（SE）模型都会学习一个点估计，并且不利用学习过程中的不确定性估计。在本文中，我们表明，通过最大程度地减少多元高斯负模样（NLL）来建模异质不确定性，以无需额外的成本来改善SE性能。在训练过程中，我们的方法增加了使用临时子模型的模型学习复杂的光谱映射，以预测每个时间频率箱增强误差的协方差。由于无限制的异质不确定性，协方差引入了底漆效果，对SE性能有害。为了减轻不足的采样，我们的方法膨胀了不确定性的下限，并使每个损失成分的不确定性加权，从而有效地补偿了严重的不足的成分，并获得了更多的惩罚。我们的多元设置揭示了常见的协方差假设，例如标量和对角线矩阵。通过削弱这些假设，我们表明NLL与流行的损耗函数相比，取得了较高的性能，包括平均平方误差（MSE），平均绝对误差（MAE）和尺度不变的信噪比（SI-SDR）。

Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).

下载PDF全文

下载文献需遵守相关版权规定

论文标题