论文标题

通过微调的KB-Be​​rt在瑞典的标点符号修复

Punctuation restoration in Swedish through fine-tuned KB-BERT

论文作者

Nilsson, John Björkman

论文摘要

此处介绍的是使用BERT模型在瑞典自动标点恢复的方法。该方法基于KB-Be​​rt,这是一种公开可用的神经网络语言模型,由瑞典国家图书馆在瑞典语料库中预先培训。然后,使用政府文本语料库对此特定任务进行了微调。该模型以较低案例和未符合的瑞典文本为输入,应该返回文本的语法正确标点副本作为输出。解决此问题的成功解决方案为一系列NLP域(例如语音到文本和自动化文本)带来了好处。由于缺乏更罕见的标记,例如分号,因此仅考虑了该项目的标点符号,逗号和问号。此外,某些标记与更常见的互动可以互换,例如感叹点和时期。因此,数据集的所有感叹点都被时期取代。被称为Prestobert的微型瑞典BERT模型的总体F1分数为78.9。提出的模型与国际同行的得分相似,匈牙利和中国模型的F1得分分别为82.2和75.6。作为进一步的比较,进行了人类评估案例研究。人类测试小组的总体F1得分为81.7,但在时期和逗号上得分比Prestobert得分要差得多。尽管F1得分差异,但检查模型和人类的输出句子表现出令人满意的结果。断开连接似乎源于不必要的重点复制测试集中使用的完全相同的标点符号,而不是提供任何正确的解释数量。如果可以重写损失函数以奖励所有语法上正确的输出,而不仅仅是一个原始示例,那么Prestobert和人类群体的性能都可以显着提高。

Presented here is a method for automatic punctuation restoration in Swedish using a BERT model. The method is based on KB-BERT, a publicly available, neural network language model pre-trained on a Swedish corpus by National Library of Sweden. This model has then been fine-tuned for this specific task using a corpus of government texts. With a lower-case and unpunctuated Swedish text as input, the model is supposed to return a grammatically correct punctuated copy of the text as output. A successful solution to this problem brings benefits for an array of NLP domains, such as speech-to-text and automated text. Only the punctuation marks period, comma and question marks were considered for the project, due to a lack of data for more rare marks such as semicolon. Additionally, some marks are somewhat interchangeable with the more common, such as exclamation points and periods. Thus, the data set had all exclamation points replaced with periods. The fine-tuned Swedish BERT model, dubbed prestoBERT, achieved an overall F1-score of 78.9. The proposed model scored similarly to international counterparts, with Hungarian and Chinese models obtaining F1-scores of 82.2 and 75.6 respectively. As further comparison, a human evaluation case study was carried out. The human test group achieved an overall F1-score of 81.7, but scored substantially worse than prestoBERT on both period and comma. Inspecting output sentences from the model and humans show satisfactory results, despite the difference in F1-score. The disconnect seems to stem from an unnecessary focus on replicating the exact same punctuation used in the test set, rather than providing any of the number of correct interpretations. If the loss function could be rewritten to reward all grammatically correct outputs, rather than only the one original example, the performance could improve significantly for both prestoBERT and the human group.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源