Banglaparaphrase：高质量的孟加拉语释义数据集

论文标题

Banglaparaphrase：高质量的孟加拉语释义数据集

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

论文作者

Akil, Ajwad, Sultana, Najrin, Bhattacharjee, Abhik, Shahriyar, Rifat

论文摘要

在这项工作中，我们提出了Banglaparaphrase，这是一种由新颖的过滤管道策划的高质量合成孟加拉语数据集。我们旨在通过引入Banglaparaphrase来减轻NLP领域中Bangla语言的低资源状况，从而通过保留语义和多样性来确保质量，从而确保质量对于增强其他Bangla数据集特别有用。我们在数据集和对其进行训练的模型与其他现有作品之间进行了详细的比较分析，以确定合成术数据生成管道的可行性。我们正在https://github.com/csebuetnlp/banglaparaphrase公开提供数据集和模型，以进一步促进孟加拉国NLP的状态。

In this work, we present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase, which ensures quality by preserving both semantics and diversity, making it particularly useful to enhance other Bangla datasets. We show a detailed comparative analysis between our dataset and models trained on it with other existing works to establish the viability of our synthetic paraphrase data generation pipeline. We are making the dataset and models publicly available at https://github.com/csebuetnlp/banglaparaphrase to further the state of Bangla NLP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题