论文标题
通过变压器网络的无端对端到端语音转换
Vocoder-free End-to-End Voice Conversion with Transformer Network
论文作者
论文摘要
与原始频谱相比,基于MEL频率的滤波器库(MFB)方法具有学习语音的优势,因为MFB的功能大小较小。但是,具有MFB方法的语音生成器需要其他辅助辅助工具,这需要大量的计算费用来培训过程。诸如MFB和Vocoder之类的其他前/后处理对于将真实的人类言论转换为他人并不是必不可少的。只能将原始频谱与阶段一起使用,以生成具有清晰发音的不同风格的声音。在这方面,我们提出了一种快速有效的方法,以平行方式使用原始频谱转换现实的声音。我们基于变压器的模型体系结构没有任何CNN或RNN层显示了快速学习并解决了常规RNN的顺序计算的限制的优势。在本文中,我们使用变压器网络介绍了无端对端到端语音转换方法。提出的转换模型也可以用于扬声器适应性以进行语音识别。我们的方法可以在不使用MFB和Vocoder的情况下将源语音转换为目标语音。我们可以通过将转换的幅度乘以相位,以获取适合语音识别的MFB。我们分别使用自然性,相似性和平均意见分数等指标来对数据集进行语音转换实验。
Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size. However, speech generator with MFB approaches require additional vocoder that needs a huge amount of computation expense for training process. The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others. It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation. In this regard, we propose a fast and effective approach to convert realistic voices using raw spectrum in a parallel manner. Our transformer-based model architecture which does not have any CNN or RNN layers has shown the advantage of learning fast and solved the limitation of sequential computation of conventional RNN. In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network. The presented conversion model can also be used in speaker adaptation for speech recognition. Our approach can convert the source voice to a target voice without using MFB and vocoder. We can get an adapted MFB for speech recognition by multiplying the converted magnitude with phase. We perform our voice conversion experiments on TIDIGITS dataset using the metrics such as naturalness, similarity, and clarity with mean opinion score, respectively.