零资源语音综合使用来自知觉声学单元的转录本

论文标题

零资源语音综合使用来自知觉声学单元的转录本

Zero resource speech synthesis using transcripts derived from perceptual acoustic units

论文作者

S, Karthik Pandia D, Murthy, Hema A

论文摘要

Zerospeech综合是构建词汇独立语音综合系统的任务，在该系统中无法用于培训数据。因此，有必要将训练数据转换为一系列基本声学单元，这些单元可在测试过程中用于合成。本文试图发现并模拟由稳态和瞬态区域组成的知觉声学单元。瞬态大致对应于简历，VC单元，而稳态对应于超声和摩擦剂。首先，通过使用短期能量样轮廓将相同的单元分割为类似CVC的单元，首先将语音信号进行预处理。这些CVC段使用基于连接的组件的图形聚类技术聚类。初始化了聚类的CVC片段，以使开始（CV）和衰减（VC）对应于瞬变，韵律对应于稳态。在此初始化之后，允许单元在连续的语音上重新组织为HMM-GMM框架中的最后一组AU。因此获得的AU序列用于训练合成模型。在Zerospeech 2019挑战数据库中评估了建议的方法的性能。主观和客观分数表明，使用拟议的AUS可以实现具有低比特率编码的合理质量合成。

Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and transient regions in speech. The transients roughly correspond to CV, VC units, while the steady-state corresponds to sonorants and fricatives. The speech signal is first preprocessed by segmenting the same into CVC-like units using a short-term energy-like contour. These CVC segments are clustered using a connected components-based graph clustering technique. The clustered CVC segments are initialized such that the onset (CV) and decays (VC) correspond to transients, and the rhyme corresponds to steady-states. Following this initialization, the units are allowed to re-organise on the continuous speech into a final set of AUs in an HMM-GMM framework. AU sequences thus obtained are used to train synthesis models. The performance of the proposed approach is evaluated on the Zerospeech 2019 challenge database. Subjective and objective scores show that reasonably good quality synthesis with low bit rate encoding can be achieved using the proposed AUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题