从半结构化Web文档中提取属性的标签自我培训

论文标题

从半结构化Web文档中提取属性的标签自我培训

Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents

论文作者

Sarkhel, Ritesh, Huang, Binxuan, Lockard, Colin, Shiralkar, Prashant

论文摘要

从HTML文档中提取结构化信息是一个长期研究的问题，其中包括知识库构造，刻面搜索和个性化建议。先前的工作依靠每个目标网站上的一些人体标记的网页或一些从某些种子网站上的人类标记的网页来培训在看不见的目标网站上概括的可转移提取模型。嘈杂的内容，较低的站点级别的一致性以及缺乏通知者协议使标签网页成为耗时且昂贵的磨难。我们开发的最少是半结构化Web文档的标签高效自我训练方法，以克服这些限制。至少利用一些人标记的页面从目标垂直行业中伪造了大量未标记的网页。它使用自我训练对人标记和伪标记的样品进行了可转移的Web取消模型训练。为了减轻由于嘈杂的训练样本而导致的错误传播，至少根据其估计的标签精度重新重量重量，并将其纳入培训中。据我们所知，这是第一项提出端到端培训的工作，用于仅利用少数人标记的页面进行可转移的Web提取模型。大规模公共数据集上的实验表明，每个种子网站上使用少于十个人体标签的页面进行培训，最不受欢迎的模型在未见网站上的平均F1点以前的最先前的F1点以前的最先前的F1点，从而减少了人类标记的页面的数量，以达到超过10倍。

Extracting structured information from HTML documents is a long-studied problem with a broad range of applications, including knowledge base construction, faceted search, and personalized recommendation. Prior works rely on a few human-labeled web pages from each target website or thousands of human-labeled web pages from some seed websites to train a transferable extraction model that generalizes on unseen target websites. Noisy content, low site-level consistency, and lack of inter-annotator agreement make labeling web pages a time-consuming and expensive ordeal. We develop LEAST -- a Label-Efficient Self-Training method for Semi-Structured Web Documents to overcome these limitations. LEAST utilizes a few human-labeled pages to pseudo-annotate a large number of unlabeled web pages from the target vertical. It trains a transferable web-extraction model on both human-labeled and pseudo-labeled samples using self-training. To mitigate error propagation due to noisy training samples, LEAST re-weights each training sample based on its estimated label accuracy and incorporates it in training. To the best of our knowledge, this is the first work to propose end-to-end training for transferable web extraction models utilizing only a few human-labeled pages. Experiments on a large-scale public dataset show that using less than ten human-labeled pages from each seed website for training, a LEAST-trained model outperforms previous state-of-the-art by more than 26 average F1 points on unseen websites, reducing the number of human-labeled pages to achieve similar performance by more than 10x.

下载PDF全文

下载文献需遵守相关版权规定

论文标题