通过线性预测结构化混合密度网络改善基于LPCNET的文本到语音

论文标题

通过线性预测结构化混合密度网络改善基于LPCNET的文本到语音

Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network

论文作者

Hwang, Min-Jae, Song, Eunwoo, Yamamoto, Ryuichi, Soong, Frank, Kang, Hong-Goo

论文摘要

在本文中，我们使用线性预测（LP）结构化混合物密度网络（MDN）提出了改进的LPCNET Vocoder。最近提出的LPCNET VOCODER通过将声带LP滤光片与基于Wavernn的人声源（即激发）发电机相结合，成功地实现了高质量和轻巧的语音合成系统。但是，综合语音的质量通常是不稳定的，因为声音源成分不足以由MU-law量化方法表示，并且该模型经过训练而无需考虑整个语音生产机制。为了解决这个问题，我们首先引入LP-MDN，这使自回归的神经声码器在结构上表示声带和声源组件之间的相互作用。然后，我们建议将LP-MDN通过连续密度分布替换常规离散输出，将LP-MDN纳入LPCNET Vocoder。实验结果验证了所提出的系统通过在文本到语音框架内达到4.41的平均意见分数来提供高质量的综合语音。

In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN). The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator. However, the quality of synthesized speech is often unstable because the vocal source component is insufficiently represented by the mu-law quantization method, and the model is trained without considering the entire speech production mechanism. To address this problem, we first introduce LP-MDN, which enables the autoregressive neural vocoder to structurally represent the interactions between the vocal tract and vocal source components. Then, we propose to incorporate the LP-MDN to the LPCNet vocoder by replacing the conventional discretized output with continuous density distribution. The experimental results verify that the proposed system provides high quality synthetic speech by achieving a mean opinion score of 4.41 within a text-to-speech framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题