论文标题
Satts:演讲者吸引者文本,言语,学习通过学习分开说话
SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate
论文作者
论文摘要
文本对语音(TTS)的映射是非确定性的,基于上下文的发音可能会有所不同,或者语音可能会因性别,年龄,口音,情感等各种生理和风格因素而变化。神经扬声器嵌入者嵌入了训练或验证扬声器的培训或验证说话者典型地代表言语的特征和验证语音的特征。另一方面,语音分离是将单个说话者与各种说话者重叠的混合信号分开的挑战性任务。演讲者的吸引子是高维嵌入向量的媒介,它们将每个演讲者的演讲的时频垃圾箱朝向自己,同时驱除那些属于其他说话者的人。在这项工作中,我们探索了在多演讲者TTS合成中使用这些强大的扬声器吸引子进行零拍式扬声器改编的可能性,并提出扬声器吸引者文本对语音(satts)。通过各种实验,我们表明Satts可以从看不见的目标发言人的参考信号中综合自然语音,该文本可能的记录条件可能少于理想的记录条件,即回响或与其他说话者混合。
The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and propose speaker attractor text to speech (SATTS). Through various experiments, we show that SATTS can synthesize natural speech from text from an unseen target speaker's reference signal which might have less than ideal recording conditions, i.e. reverberations or mixed with other speakers.