论文标题
Ultron:具有基于模型索引器的语料库的最终检索器
Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer
论文作者
论文摘要
数十年来,文档检索已在索引 - 撤离框架中进行了广泛的研究,这已经经受了时间的考验。不幸的是,这样的管道框架限制了最终检索质量的优化,因为索引和检索是无法以端到端方式共同优化的分离阶段。为了统一这两个阶段,我们探索了一个基于模型的索引来进行文档检索。具体而言,我们提出了Ultron的建议,该Ultron将所有文档的知识编码到模型中,并旨在直接检索相关文档端到端。对于基于模型的索引器,如何表示文档以及如何训练模型是要探索的两个主要问题。现有的解决方案遭受语义上缺陷的文档形式和有限的监督数据。首先,为了解决这两个问题,我们设计了两种类型的文档,这些文档具有更丰富的语义,更容易用于模型推断。此外,我们提出了一个三阶段的培训工作流程,以捕获语料库中包含的更多知识以及查询与文档之间的关联。两个公共数据集的实验证明了Ultron优于先进基线以进行文件检索。
Document retrieval has been extensively studied within the index-retrieve framework for decades, which has withstood the test of time. Unfortunately, such a pipelined framework limits the optimization of the final retrieval quality, because indexing and retrieving are separated stages that can not be jointly optimized in an end-to-end manner. In order to unify these two stages, we explore a model-based indexer for document retrieval. Concretely, we propose Ultron, which encodes the knowledge of all documents into the model and aims to directly retrieve relevant documents end-to-end. For the model-based indexer, how to represent docids and how to train the model are two main issues to be explored. Existing solutions suffer from semantically deficient docids and limited supervised data. To tackle these two problems, first, we devise two types of docids that are richer in semantics and easier for model inference. In addition, we propose a three-stage training workflow to capture more knowledge contained in the corpus and associations between queries and docids. Experiments on two public datasets demonstrate the superiority of Ultron over advanced baselines for document retrieval.