具有预训练的语言模型的视觉丰富文档的强大布局意识IE

论文标题

具有预训练的语言模型的视觉丰富文档的强大布局意识IE

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

论文作者

Wei, Mengxi, He, Yifan, Zhang, Qiong

论文摘要

在现代NLP和IR管道中处理的许多业务文件在视觉上都是丰富的：除了文本外，其语义还可以通过布局，格式和字体等视觉特征来捕获。我们研究了从视觉丰富文档（VRD）中提取信息的问题，并提出了一个模型，该模型结合了大型预训练的语言模型和图形神经网络的功能，以有效地在业务文档中同时编码文本和视觉信息。我们进一步引入了新的微调目标，以改善内域无监督的微调，以更好地利用大量未标记的内域数据。我们在现实世界的发票和恢复数据集上进行了实验，并表明所提出的方法在发票上优于基于文本的Roberta Baseline的绝对F1 6.3％，而简历上的绝对F1则超过了4.7％。当在几次设置中进行评估时，我们的方法需要比基线的注释数据少30倍，以达到〜90％F1的相同性能水平。

Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data. We experiment on real world invoice and resume data sets and show that the proposed method outperforms strong text-based RoBERTa baselines by 6.3% absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a few-shot setting, our method requires up to 30x less annotation data than the baseline to achieve the same level of performance at ~90% F1.

下载PDF全文

下载文献需遵守相关版权规定

论文标题