论文标题
在公共卫生应用中匹配逻辑名称的机器学习分类器:在大规模概率记录链接中合并语音,视觉和击键相似性的方法
Machine-learning classifiers for logographic name matching in public health applications: approaches for incorporating phonetic, visual, and keystroke similarity in large-scale probabilistic record linkage
论文作者
论文摘要
在高度歧视性文本字段(例如个人名称)中说明复杂变化的近似字符串匹配方法可以增强概率记录链接。但是,对匹配和非匹配字符串进行区分对于逻辑脚本而言是一项挑战,在字符串数据中未直接编码发音,外观或击键序列中的相似性。我们利用具有已知匹配状态的大型中国管理数据集来开发逻辑回归和XGBoost分类器,集成了视觉,语音和击键相似性的度量,以增强对潜在匹配名称对的识别。我们评估了在大规模概率记录链接中利用名称相似性得分的三种方法,这些方法可以适应支持字段中不同的匹配率和信息:(1)基于所有记录对的名称匹配质量设置阈值得分; (2)基于链接模型的预测歧视能力设置阈值得分; (3)使用匹配和非匹配之间的经验得分分布来执行根据精确验证链接估算的匹配概率调整。在有关保留数据的实验以及使用不同的名称错误率和支持字段模拟的数据中,通过贝叶斯方法合并的逻辑回归分类器表明,与确切的验证链接相对于歧视性幂,匹配概率估计和准确性,将模拟的记录率的总数降低到平均数据中的21%的数据中,对歧视性的差异,匹配概率估计和准确性降低了21%的数据和平均数据的总数。我们的结果证明了将视觉,语音和击键相似性合并到逻辑名称匹配的价值,以及我们贝叶斯方法在大规模记录链接中利用名称匹配的承诺。
Approximate string-matching methods to account for complex variation in highly discriminatory text fields, such as personal names, can enhance probabilistic record linkage. However, discriminating between matching and non-matching strings is challenging for logographic scripts, where similarities in pronunciation, appearance, or keystroke sequence are not directly encoded in the string data. We leverage a large Chinese administrative dataset with known match status to develop logistic regression and Xgboost classifiers integrating measures of visual, phonetic, and keystroke similarity to enhance identification of potentially-matching name pairs. We evaluate three methods of leveraging name similarity scores in large-scale probabilistic record linkage, which can adapt to varying match prevalence and information in supporting fields: (1) setting a threshold score based on predicted quality of name-matching across all record pairs; (2) setting a threshold score based on predicted discriminatory power of the linkage model; and (3) using empirical score distributions among matches and nonmatches to perform Bayesian adjustment of matching probabilities estimated from exact-agreement linkage. In experiments on holdout data, as well as data simulated with varying name error rates and supporting fields, a logistic regression classifier incorporated via the Bayesian method demonstrated marked improvements over exact-agreement linkage with respect to discriminatory power, match probability estimation, and accuracy, reducing the total number of misclassified record pairs by 21% in test data and up to an average of 93% in simulated datasets. Our results demonstrate the value of incorporating visual, phonetic, and keystroke similarity for logographic name matching, as well as the promise of our Bayesian approach to leverage name-matching within large-scale record linkage.