论文标题
EACELEB:东亚语言的名人数据集用于说话者认可
EACELEB: An East Asian Language Speaking Celebrity Dataset for Speaker Recognition
论文作者
论文摘要
大型数据集对于培训演讲者识别系统非常有用,多年来,各种研究小组已经建造了几个。 Voxceleb是一个大型数据集,可从YouTube视频中提取。本文提出了一种视听方法,用于从YouTube获取扬声器名称为输入的音频数据。该系统遵循类似于Voxceleb数据采集方法的管道。但是,我们的工作重点是通过检测到面部后的框架中的面部跟踪来进行快速数据获取 - 考虑到其计算成本,这比每个帧比面部检测更可取。我们表明,在获取后,将音频诊断应用到我们的数据上可以产生与Voxceleb相当的同等错误率。一组二次实验表明,我们可以通过使用获得的数据来微调预训练的X矢量系统来进一步降低错误率。像Voxceleb一样,这里的作品主要着重于为名人开发音频。但是,与Voxceleb不同,我们的目标音频数据来自东亚国家的名人。最后,我们设置了一项扬声器验证任务,以评估获取数据的准确性。诊断和微调后,我们在整个数据集中达到了大约4 \%的同等错误率。
Large datasets are very useful for training speaker recognition systems, and various research groups have constructed several over the years. Voxceleb is a large dataset for speaker recognition that is extracted from Youtube videos. This paper presents an audio-visual method for acquiring audio data from Youtube given the speaker's name as input. The system follows a pipeline similar to that of the Voxceleb data acquisition method. However, our work focuses on fast data acquisition by using face-tracking in subsequent frames once a face has been detected -- this is preferable over face detection for every frame considering its computational cost. We show that applying audio diarization to our data after acquiring it can yield equal error rates comparable to Voxceleb. A secondary set of experiments showed that we could further decrease the error rate by fine-tuning a pre-trained x-vector system with the acquired data. Like Voxceleb, the work here focuses primarily on developing audio for celebrities. However, unlike Voxceleb, our target audio data is from celebrities in East Asian countries. Finally, we set up a speaker verification task to evaluate the accuracy of our acquired data. After diarization and fine-tuning, we achieved an equal error rate of approximately 4\% across our entire dataset.