OCR校正和拼写标准化的无监督方法

论文标题

OCR校正和拼写标准化的无监督方法

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

论文作者

Duong, Quan, Hämäläinen, Mika, Hengchen, Simon

论文摘要

已知历史语料库包含数字化过程中使用的OCR（光学特征识别）方法引入的错误，通常据说会降低NLP系统的性能。手动纠正这些错误是一个耗时的过程，并且自动方法的很大一部分一直依赖于规则或监督的机器学习。我们基于以前的工作，基于对平行数据进行全自动的无监督提取，以训练基于字符的序列到序列NMT（神经机器翻译）模型，以进行针对英语设计的OCR误差校正，并通过提出将语言形态丰富的解决方案改编成Finnish。我们的新方法显示出提高的性能，同时保持完全无监督，这是拼写标准化的额外好处。源代码和模型可在Github和Zenodo上找到。

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题