论文标题

MSLAM:大量多语言的言语和文本预培训

mSLAM: Massively multilingual joint pre-training for speech and text

论文作者

Bapna, Ankur, Cherry, Colin, Zhang, Yu, Jia, Ye, Johnson, Melvin, Cheng, Yong, Khanuja, Simran, Riesa, Jason, Conneau, Alexis

论文摘要

我们提出了MSLAM,这是一种多语言语言和语言模型,通过共同对大量的无标记的语音和多种语言来学习跨语性的语音和文本跨模式表示。 MSLAM结合了W2V-BERT对语音的预训练与Spanbert对角色级文本的预训练,以及在配对的语音和成绩单数据上的连接主义时间分类(CTC)损失,以学习一个能够在共享表示空间中学习并代表语音和文本信号的单个模型。我们评估了MSLAM在几个下游语音理解任务上,发现与文本的联合预培训可提高语音翻译,语音意图分类和语音语言ID的质量,同时与仅语音的预培训进行比较,同时在多语言ASR上具有竞争力。我们的语音翻译模型展示了零拍的文本翻译,而没有看到任何文本翻译数据,提供了跨模式对准表示的证据。 MSLAM还受益于多模式微调,通过在微调过程中直接利用文本翻译数据来进一步提高语音翻译质量。我们的经验分析强调了大规模多模式预训练引起的几个机遇和挑战,这暗示了未来研究的方向。

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源