论文标题
地标和区域:一种强大的数据提取方法
Landmarks and Regions: A Robust Approach to Data Extraction
论文作者
论文摘要
我们提出了一种从半结构化文档中提取数据项或现场值的新方法。此类问题的示例包括从旅行行程中提取乘客名称,出发时间和出发机场,或从购买收据中提取物品的价格。传统的数据提取方法使用机器学习或程序合成来处理整个文档以提取所需的字段。这种方法对于格式化文档的更改并不强大,即使对文档部分与所需的感兴趣领域无关的部分更改,提取过程通常会失败。我们根据地标和地区的概念提出了一种新的数据提取方法。人类通常在文档的手动处理中使用地标将其放大并将注意力集中在文档中感兴趣的小区域。受这种人类直觉的启发,我们使用计划合成中地标的概念自动合成提取程序,该程序首先提取了一个较小的感兴趣区域,然后在随后的步骤中自动从该区域中提取所需值。我们已经在工具LRSYN中实现了基于里程碑的提取方法,并对HTML中的文档以及扫描发票和收据的图像进行了广泛的评估。我们的结果表明,我们的方法对现实世界中常规发生的各种格式更改具有鲁棒性。
We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to format changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human intuition, we use the notion of landmarks in program synthesis to automatically synthesize extraction programs that first extract a small region of interest, and then automatically extract the desired value from the region in a subsequent step. We have implemented our landmark-based extraction approach in a tool LRSyn, and show extensive evaluation on documents in HTML as well as scanned images of invoices and receipts. Our results show that our approach is robust to various types of format changes that routinely happen in real-world settings.