必须cinema：语音到纸的语料库

论文标题

必须cinema：语音到纸的语料库

MuST-Cinema: a Speech-to-Subtitles corpus

论文作者

Karakanta, Alina, Negri, Matteo, Turchi, Marco

论文摘要

通过字幕以多种语言为多种语言的本地视听内容的需求越来越多，要求开发人类字幕的自动解决方案。神经机器翻译（NMT）可以有助于副标题的自动化，促进人类小标机的工作，并减少转弯时间和相关成本。 NMT需要高质量的大型，特定于任务的培训数据。但是，现有的字幕语料库都缺少与源语言音频的一致性和有关字幕中断的重要信息。这对开发有效的自动方法的字幕构成了重要限制，因为字幕的长度和形式直接取决于话语的持续时间。在这项工作中，我们介绍了Must-Cinema，这是由TED字幕构建的多语言语音翻译语料库。该语料库由（音频，转录，翻译）三胞胎组成。字幕断断续续通过插入特殊符号来保留。我们表明，该语料库可用于构建模型，该模型有效地将句子划分为字幕，并提出了一种用字幕断断续续的注释现有字幕语料库的方法，并符合长度的约束。

Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.

下载PDF全文

下载文献需遵守相关版权规定

论文标题