论文标题
增强基于词汇的方法,以及越南多项选择机理解理解的外部知识
Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension
论文作者
论文摘要
尽管越南人是世界上第17位最受欢迎的母语语言,但对越南机器阅读理解(MRC)的研究并不多,这是理解文本和回答有关它的问题的任务。原因之一是由于缺乏此任务的高质量基准数据集。在这项工作中,我们构建了一个数据集,该数据集由基于417个越南文本的2,783对多项选择问题和答案组成,这些问题通常用于教授小学学生阅读理解。此外,我们提出了一种基于词汇的MRC方法,该方法利用语义相似性度量和外部知识源来分析问题并从给定文本中提取答案。我们将所提出模型的性能与几个基线基于词汇和基于神经网络的模型进行了比较。我们提出的方法通过准确性达到61.81%,比最佳基线模型高5.51%。我们还测量了数据集中的人类绩效,发现机器模型和人类表演之间存在很大的差距。这表明可以在此任务上取得重大进展。该数据集可在我们的网站上免费提供,以进行研究。
Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website for research purposes.