从结构和语义的角度嵌入的异质网络的作者名称歧义歧义

论文标题

从结构和语义的角度嵌入的异质网络的作者名称歧义歧义

Author Name Disambiguation via Heterogeneous Network Embedding from Structural and Semantic Perspectives

论文作者

Xie, Wenjin, Liu, Siyuan, Wang, Xiaomeng, Jia, Tao

论文摘要

名称歧义在学术数字库中很常见，例如具有相同名称的多个作者。这给学术数据管理和分析带来了挑战，因此必须进行名称歧义。名称歧义的过程是将具有相同名称的出版物分为不同的组，每个组属于独特的作者。出版物中的大量属性信息使传统方法属于特征选择的泥潭。这些方法始终人为和平等地选择属性，这通常会对准确性产生负面影响。所提出的方法主要基于对异质网络的表示学习和聚类，并利用自我发项式技术来解决问题。出版物的介绍是结构和语义表示的综合。结构表示通过基于元路径的采样和基于跳过的嵌入方法获得，并引入元路径级别的注意力以自动学习每个功能的重量。语义表示是使用NLP工具生成的。与基线相比，我们的建议在名称歧义准确性方面的表现更好，而消融实验则证明了通过特征选择和我们方法中的元路径级别的关注。实验结果表明，我们新方法的优势是从出版物中捕获最大属性并减少冗余信息的影响。

Name ambiguity is common in academic digital libraries, such as multiple authors having the same name. This creates challenges for academic data management and analysis, thus name disambiguation becomes necessary. The procedure of name disambiguation is to divide publications with the same name into different groups, each group belonging to a unique author. A large amount of attribute information in publications makes traditional methods fall into the quagmire of feature selection. These methods always select attributes artificially and equally, which usually causes a negative impact on accuracy. The proposed method is mainly based on representation learning for heterogeneous networks and clustering and exploits the self-attention technology to solve the problem. The presentation of publications is a synthesis of structural and semantic representations. The structural representation is obtained by meta-path-based sampling and a skip-gram-based embedding method, and meta-path level attention is introduced to automatically learn the weight of each feature. The semantic representation is generated using NLP tools. Our proposal performs better in terms of name disambiguation accuracy compared with baselines and the ablation experiments demonstrate the improvement by feature selection and the meta-path level attention in our method. The experimental results show the superiority of our new method for capturing the most attributes from publications and reducing the impact of redundant information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题