论文标题

Wikilingua:​​用于跨语义抽象摘要的新基准数据集

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

论文作者

Ladhak, Faisal, Durmus, Esin, Cardie, Claire, McKeown, Kathleen

论文摘要

我们介绍了Wikilingua,这是一个大规模的多语言数据集,用于评估跨语言抽象性摘要系统。我们从Wikihow中提取18种语言的文章和摘要对,这是一种高质量的,合作的资源,是针对人类作者撰写的各种主题的方法指南。我们通过对齐用来描述文章中每个操作方法的图像来创建跨语言的金标准文章对齐。作为一组进一步研究的基准,我们评估了数据集上现有的跨语性抽象摘要方法的性能。我们进一步提出了一种通过利用合成数据和神经机器翻译作为预训练步骤的直接跨语言摘要(即,不需要推理时间不需要翻译)的方法。我们的方法大大优于基线方法,同时在推断过程中更具成本效益。

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源