论文标题
有条件的口头数字生成stylegan
Conditional Spoken Digit Generation with StyleGAN
论文作者
论文摘要
本文适应了一种时尚模型,用于语音生成,文本对文本的条件很小或没有条件。 Stylegan是一种多尺度卷积GAN,能够在多个空间(或时间)级别上分层捕获数据结构和潜在变化。该模型以前在面部图像生成上取得了令人印象深刻的结果,并且由于数据中存在类似的多级结构而引起的音频应用程序。在本文中,我们在语音命令数据集上训练了一个stylegan,以生成频率频谱图,该数据集包含在不同声学条件下由多个扬声器说出的口头数字。在有条件的设置中,我们的模型以数字身份为条件,而学习剩余的数据变化仍然是无监督的任务。我们将我们的模型与当前无监督的最新语音综合gan架构,Wavegan进行了比较,并表明该模型根据数值度量和主观评估通过听力测试优于主观评估。
This paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. StyleGAN is a multi-scale convolutional GAN capable of hierarchically capturing data structure and latent variation on multiple spatial (or temporal) levels. The model has previously achieved impressive results on facial image generation, and it is appealing to audio applications due to similar multi-level structures present in the data. In this paper, we train a StyleGAN to generate mel-frequency spectrograms on the Speech Commands dataset, which contains spoken digits uttered by multiple speakers in varying acoustic conditions. In a conditional setting our model is conditioned on the digit identity, while learning the remaining data variation remains an unsupervised task. We compare our model to the current unsupervised state-of-the-art speech synthesis GAN architecture, the WaveGAN, and show that the proposed model outperforms according to numerical measures and subjective evaluation by listening tests.