论文标题
通过基于池的主动学习与实体匹配
Entity Matching by Pool-based Active Learning
论文作者
论文摘要
实体匹配的目的是找到来自不同数据源的相同现实世界实体的相应记录。目前,在主流方法中,基于规则的实体匹配方法需要巨大的领域知识。基于机器学习的基于机器或深度学习的实体匹配方法需要大量标记的样本来构建模型,这在某些应用中很难实现。此外,基于学习的方法易于过度拟合,因此培训样本的质量要求很高。在本文中,我们为实体匹配任务提供了一种主动学习方法。此方法需要手动标记少数有价值的样本,并使用这些样品来构建高质量的模型。本文提出了一种混合不确定性作为查询策略,以找到那些有价值的标签样本,这可以最大程度地减少标记的培训样本的数量同时满足任务要求。该方法已在不同字段中的七个数据集上进行了验证。该实验表明,与现有方法相比,Almatcher仅使用少量标记的样品,并获得更好的结果。
The goal of entity matching is to find the corresponding records representing the same real-world entity from different data sources. At present, in the mainstream methods, rule-based entity matching methods need tremendous domain knowledge. The machine-learning based or deep-learning based entity matching methods need a large number of labeled samples to build the model, which is difficult to achieve in some applications. In addition, learning-based methods are easy to over-fitting, so the quality requirements of training samples are very high. In this paper, we present an active learning method ALMatcher for the entity matching tasks. This method needs to manually label only a small number of valuable samples, and use these samples to build a model with high quality. This paper proposes a hybrid uncertainty as query strategy to find those valuable samples for labeling, which can minimize the number of labeled training samples meanwhile meet the task requirements. The proposed method has been validated on seven data sets in different fields. The experiment shows that ALMatcher uses only a small number of labeled samples and achieves better results compared to existing approaches.