扬声器适应具有直观的韵律特征，用于统计参数语音综合

论文标题

扬声器适应具有直观的韵律特征，用于统计参数语音综合

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

论文作者

Cheng, Pengyu, Ling, Zhenhua

论文摘要

在本文中，我们提出了一种具有直观的韵律特征的扬声器适应方法，用于统计参数语音综合。考虑到它们与不同扬声器的整体韵律特征直接相关，此方法中采用的直观韵律特征包括音高，音高范围，语音速率和能量。直观的韵律特征是在发言级或说话者级别提取的，并进一步集成到现有的基于扬声器编码和基于扬声器的演讲者的适应框架中。声学模型是基于tacotron2的序列到序列。直观的韵律特征与文本编码器输出和用于解码声学特征的扬声器矢量相连。实验结果表明，与没有直观的韵律特征的基线方法相比，我们提出的方法可以实现更好的客观和主观性能。此外，提出的具有语音级韵律特征的扬声器适应方法达到了所有比较方法中合成语音的最佳相似性。

In this paper, we propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis. The intuitive prosodic features employed in this method include pitch, pitch range, speech rate and energy considering that they are directly related with the overall prosodic characteristics of different speakers. The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively. The acoustic models are sequence-to-sequence ones based on Tacotron2. Intuitive prosodic features are concatenated with text encoder outputs and speaker vectors for decoding acoustic features.Experimental results have demonstrated that our proposed methods can achieve better objective and subjective performance than the baseline methods without intuitive prosodic features. Besides, the proposed speaker adaption method with utterance-level prosodic features has achieved the best similarity of synthetic speech among all compared methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题