论文标题

结构和语义保存文档表示形式

Structure and Semantics Preserving Document Representations

论文作者

Raman, Natraj, Shah, Sameena, Veloso, Manuela

论文摘要

从语料库中检索相关文档通常是基于文档内容和查询文本之间的语义相似性。文档之间包含结构关系可以通过解决语义差距来使检索机制受益。但是,结合这些关系需要可进行的可进行的机制,以平衡结构与语义,并利用普遍的预训练/微调范式。我们在这里提出了一种通过将文档内内容与文档间关系整合在一起的整体学习文档表示形式的方法。我们的深度度量学习解决方案分析了关系网络中的复杂邻域结构,以有效采样相似/不同的文档对,并定义了一种新颖的五五联图损耗函数,同时鼓励在语义上相关的文档对,以靠近和结构无关,以在表示空间中与众不同。此外,文档之间的分离边缘会灵活地变化,以编码关系强度的异质性。该模型是完全可调的,并且本质上支持推断期间的查询投影。我们证明,在多个数据集上,用于文档检索任务上的竞争方法胜过竞争方法。

Retrieving relevant documents from a corpus is typically based on the semantic similarity between the document content and query text. The inclusion of structural relationship between documents can benefit the retrieval mechanism by addressing semantic gaps. However, incorporating these relationships requires tractable mechanisms that balance structure with semantics and take advantage of the prevalent pre-train/fine-tune paradigm. We propose here a holistic approach to learning document representations by integrating intra-document content with inter-document relations. Our deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space. Furthermore, the separation margins between the documents are varied flexibly to encode the heterogeneity in relationship strengths. The model is fully fine-tunable and natively supports query projection during inference. We demonstrate that it outperforms competing methods on multiple datasets for document retrieval tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源