论文标题
完全层次的细粒韵律建模,用于可解释的语音综合
Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis
论文作者
论文摘要
本文提出了一个基于TaCotron 2文本到语音模型的韵律的分层,细粒度和可解释的潜在变量模型。它通过调节更粗的水平来实现韵律的多分辨率建模。此外,它使用有条件的变异自动编码器(VAE)施加了所有潜在维度的层次条件,并具有自动回归结构。重建性能的评估表明,新结构不会降低模型,同时允许更好的解释性。提供韵律属性的解释以及单词级别和电话级韵律表示之间的比较。此外,定性和定量评估都用于证明潜在维度的分离的改善。
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with an auto-regressive structure. Evaluation of reconstruction performance illustrates that the new structure does not degrade the model while allowing better interpretability. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.