【RAG 论文】GenRead：“generate-read“ 可能比 “retrieve-read“ 更有效

server/2024/9/24 10:20:28/

论文：Generate rather than Retrieve: Large Language Models are Strong Context Generators
⭐⭐⭐⭐
ICLR 2023
Code: github.com/wyu97/GenRead

该工作发现：由 LLM 生成的文档中，往往比 retrieved documents 更可能包含正确的答案。于是，该工作尝试走一条与 retrieve-then-read pipeline 不同的思路：generate-then-read pipeline（GenRead）。

generate-then-read 的基本思路是：

注意，Generate 步骤和 Read 步骤所使用的 LLM 可以不同，Read 步骤可以使用一个较小的、针对特定数据集训练的模型（如 FiD）

论文指出：如何让 LLM 生成多样的、高质量的上下文文档是一个具有挑战性的任务。这里介绍一下论文的一些做法。

由于单一的 prompt 会产生相似的 token distribution，所以这里让 human annotators 去提供多个不同的 prompt，从而引导 LLM 去生成多样化的上下文 documents。

论文指出，这个做法虽然简单，但是很有效。

Generate 步骤中，在引导 LLM 去生成文档时，可以使用 In-Context Learning 来引导其生成多样化的文档，为了达成这一目的，可以在提供给 LLM 的 few-shot exemplars 中尽量放一些多样化的 question-doc 示例。

具体做法如下：

针对数据集的每一个 question，使用 retriever（如 BM25）检索出一个对应的 document。由此获得一堆 question-document pairs。
使用 LLM（如 GPT-3）去对 pairs 中的每一个 document 进行编码，得到一个 12288 维度的 vector。然后对这些 vectors 做 K-means 聚类。
在每个聚类的 cluster 中，随机选出 n 个 question-document pair，作为 exemplars。

通过以上方法，就可以拿到用于 in-context learning 的 in-context demostrations 来引导 LLM 生成多样化的上下文文档。

如下图展示了 clustering-based prompting 方法的整体架构：

在这里插入图片描述