论文标题
精确的零射击量检索,无相关标签
Precise Zero-Shot Dense Retrieval without Relevance Labels
论文作者
论文摘要
虽然在任务和语言之间显示了密集的检索有效效率,但在没有相关性标签时,很难创建有效的完全零射击密集检索系统。在本文中,我们认识到零拍学习和编码相关性的困难。取而代之的是,我们建议通过假设的文档嵌入〜(HYDE)枢转。给定查询,Hyde First Zero-Shot指示了跟随语言模型(例如,指令)来生成假设的文档。该文档捕获了相关模式,但不真实,可能包含虚假的细节。然后,一个无监督的编码器〜(例如,Contriever)将文档编码为嵌入向量。该矢量标识了语料库嵌入空间中的一个社区,在该空间中,基于向量相似性检索了类似的实际文档。第二步将生成的文档接地到实际语料库,编码器的密集瓶颈会滤除不正确的细节。我们的实验表明,Hyde的表现明显优于最先进的无监督的浓密检索器的反对,并显示出与微调检索器相当的强大性能,跨越了各种任务(例如Web搜索,QA,事实验证)和语言〜(例如SW,KO,KO,JA)。
While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).