论文标题
使用Colberter引入全词的神经袋:使用增强的降低
Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction
论文作者
论文摘要
与经典方法相比,神经信息检索的最新进展表明,有效性取得了很大的提高,同时经常牺牲神经模型的效率和解释性。本文提出了Colberter,这是一种使用上下文化的晚期相互作用(COLBERT)的神经检索模型,并增加了减少。沿着有效性的帕累托边境,科尔伯特的减少幅度大大降低了科尔伯特的存储要求,同时提高了其令牌匹配分数的解释性。为此,Colberter融合了单载体检索,多矢量改进和可选的词汇匹配组件,为一个模型。对于其多矢量组件,科尔伯特通过学习每个文档中的术语的独特的全单词表示形式,并学习识别和删除对有效评分并不重要的单词表示,从而减少了每个文档的存储矢量数量。我们采用明确的多任务,多阶段训练来促进使用非常小的矢量尺寸。 MS MARCO和TREC-DL收集的结果表明,Colberter可以将存储足迹降低2.5倍,同时保持有效性。 Colberter在最小的环境中只有一个尺寸的一个尺寸,可实现与明文规模的索引存储奇偶校验,并具有非常强大的有效性结果。最后,我们证明了科尔伯特在七个高质量的外域收藏中的鲁棒性,从而对传统的检索基线产生了统计学上的显着增长。
Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reductions dramatically lower ColBERT's storage requirements while simultaneously improving the interpretability of its token-matching scores. To this end, ColBERTer fuses single-vector retrieval, multi-vector refinement, and optional lexical matching components into one model. For its multi-vector component, ColBERTer reduces the number of stored vectors per document by learning unique whole-word representations for the terms in each document and learning to identify and remove word representations that are not essential to effective scoring. We employ an explicit multi-task, multi-stage training to facilitate using very small vector dimensions. Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5x, while maintaining effectiveness. With just one dimension per token in its smallest setting, ColBERTer achieves index storage parity with the plaintext size, with very strong effectiveness results. Finally, we demonstrate ColBERTer's robustness on seven high-quality out-of-domain collections, yielding statistically significant gains over traditional retrieval baselines.