自动识别和对特定领域学术文章的未来工作句子的分类

论文标题

自动识别和对特定领域学术文章的未来工作句子的分类

Automatic Recognition and Classification of Future Work Sentences from Academic Articles in a Specific Domain

论文作者

Zhang, Chengzhi, Xiang, Yi, Hao, Wenke, Li, Zhicheng, Qian, Yuchen, Wang, Yuzhuo

论文摘要

未来的工作句子（FWS）是学术论文中的特定句子，其中包含作者对拟议的后续研究方向的描述。本文介绍了自动从学术论文中提取FW的方法，并根据本文内容体现的不同未来方向对其进行分类。 FWS识别方法将使随后的研究人员能够更准确，快速地找到未来的工作句子，并减少获取语料库的时间和成本。当前关于自动识别未来工作句子的工作相对较小，现有的研究无法准确地从学术论文中识别出FW，因此无法大规模进行数据挖掘。此外，未来工作的内容有许多方面，并且内容的细分有助于分析特定的开发方向。在本文中，自然语言处理（NLP）被用作案例研究，从学术论文中提取FW并将其分类为不同的类型。我们手动构建具有六种不同类型的FW的注释语料库。然后，使用机器学习模型实现了FW的自动识别和分类，并根据评估指标比较这些模型的性能。结果表明，Bernoulli贝叶斯模型在自动识别任务中具有最佳性能，宏F1达到90.73％，而SCIBERT模型在自动分类任务中具有最佳性能，而加权平均F1达到72.63％。最后，我们从FWS中提取关键字，并对FWS中描述的关键内容有深入的了解，我们还证明，FWS中的内容确定将通过衡量未来的工作句子与摘要之间的相似性来反映在随后的研究工作中。

Future work sentences (FWS) are the particular sentences in academic papers that contain the author's description of their proposed follow-up research direction. This paper presents methods to automatically extract FWS from academic papers and classify them according to the different future directions embodied in the paper's content. FWS recognition methods will enable subsequent researchers to locate future work sentences more accurately and quickly and reduce the time and cost of acquiring the corpus. The current work on automatic identification of future work sentences is relatively small, and the existing research cannot accurately identify FWS from academic papers, and thus cannot conduct data mining on a large scale. Furthermore, there are many aspects to the content of future work, and the subdivision of the content is conducive to the analysis of specific development directions. In this paper, Nature Language Processing (NLP) is used as a case study, and FWS are extracted from academic papers and classified into different types. We manually build an annotated corpus with six different types of FWS. Then, automatic recognition and classification of FWS are implemented using machine learning models, and the performance of these models is compared based on the evaluation metrics. The results show that the Bernoulli Bayesian model has the best performance in the automatic recognition task, with the Macro F1 reaching 90.73%, and the SCIBERT model has the best performance in the automatic classification task, with the weighted average F1 reaching 72.63%. Finally, we extract keywords from FWS and gain a deep understanding of the key content described in FWS, and we also demonstrate that content determination in FWS will be reflected in the subsequent research work by measuring the similarity between future work sentences and the abstracts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题