论文标题

干草堆中的公证人 - 反对与CNN的文档处理中的类不平衡

The Notary in the Haystack -- Countering Class Imbalance in Document Processing with CNNs

论文作者

Leipert, Martin, Vogeler, Georg, Seuret, Mathias, Maier, Andreas, Christlein, Vincent

论文摘要

公证工具是文档的类别。公证工具可以通过其公证符号(证书中的突出符号)与其他文档区分开,该符号也允许识别文件的发行人。自然,关于其他文件的公证仪器的人数不足。这使得分类变得困难,因为培训数据中的类失衡会使卷积神经网络的性能恶化。在这项工作中,我们评估了此问题的不同对策。它们被应用于中世纪文档集合的二进制分类和细分任务。在分类中,公证仪器与其他文档有区别,而公证符号则与分割任务中的证书分开。我们评估了不同的技术,例如数据增强,不足和过度采样,并与局灶性损失进行正规化。随机少数族裔过度采样和数据增强的结合可带来最佳性能。在细分中,我们评估了三个损失功能及其组合,在这些损失功能中,只有班级加权骰子损失才能充分分割公证符号。

Notarial instruments are a category of documents. A notarial instrument can be distinguished from other documents by its notary sign, a prominent symbol in the certificate, which also allows to identify the document's issuer. Naturally, notarial instruments are underrepresented in regard to other documents. This makes a classification difficult because class imbalance in training data worsens the performance of Convolutional Neural Networks. In this work, we evaluate different countermeasures for this problem. They are applied to a binary classification and a segmentation task on a collection of medieval documents. In classification, notarial instruments are distinguished from other documents, while the notary sign is separated from the certificate in the segmentation task. We evaluate different techniques, such as data augmentation, under- and oversampling, as well as regularizing with focal loss. The combination of random minority oversampling and data augmentation leads to the best performance. In segmentation, we evaluate three loss-functions and their combinations, where only class-weighted dice loss was able to segment the notary sign sufficiently.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源