Deepsinger：与网络挖掘的数据一起唱歌声音合成

论文标题

Deepsinger：与网络挖掘的数据一起唱歌声音合成

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

论文作者

Ren, Yi, Tan, Xu, Qin, Tao, Luan, Jian, Zhao, Zhou, Liu, Tie-Yan

论文摘要

在本文中，我们开发了Deepsinger，这是一种多语言多语音歌声声音合成（SVS）系统，该系统是使用音乐网站挖掘的歌唱培训数据从头开始构建的。 Deepsinger的管道包括多个步骤，包括数据爬行，唱歌和伴奏分离，歌词对齐，数据过滤和唱歌建模。具体而言，我们设计了一个歌词对齐模型，以自动提取从粗粒句子级别到细粒度音素级别开始的歌词中每个音素的持续时间，并进一步设计了基于馈电式变压器的多语言多语音唱片模型，以直接从歌词中生成线性光谱图，并使用Griffin-lim使用Griffin-lim。 Deepsinger比以前的SVS系统具有多个优势：1）据我们所知，这是第一个SVS系统直接从音乐网站中挖掘出培训数据，2）歌词的歌词对准模型进一步避免了任何人类的努力，可以避免任何人类的努力来对齐标记，并大大降低了基于feed-forparext and acmouttion actim and acmouttion the formation the fare formist acmist acmouts acmoutt acmoutt acmoutt acmoutt acmoutt acmoutt acmoutt acmoutt acmoutt acmoutt acmoutt acmouts，并有效地将其改进，并有效地进行了有效的参数效果合成并利用参考编码器从嘈杂的唱歌数据中捕获歌手的音色，而4）它可以用多种语言和多个歌手合成唱歌声音。我们评估了我们的挖掘唱歌数据集的Deepsinger，该数据集由三种语言（中文，粤语和英语）的89位歌手的大约92小时数据组成。结果表明，通过纯粹从网络挖掘的歌声数据，Deepsinger可以从音高的准确性和语音自然性方面综合高质量的歌声（脚注：我们的音频样本显示在https：//speechresearch.githerearch.github.github.io/deepsinger/中。

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)

下载PDF全文

下载文献需遵守相关版权规定

论文标题