LLMs之RAG:《Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely》翻译与解读
导读:这篇论文综述了如何更有效地利用外部数据增强大型语言模型 (LLM) 的能力,并提出了一个系统性的方法。
>> 背景痛点:
● LLM 的局限性:尽管LLM展现出强大的能力,但在特定领域应用时仍然面临诸多挑战,例如:模型幻觉、缺乏领域特定知识、时效性差等。
● 数据增强LLM的挑战:将外部数据(尤其是私有数据)整合到LLM中,虽然能提升性能,但面临诸多挑战,包括:数据预处理和索引、相关数据检索、用户意图理解、充分利用LLM的推理能力,以及不同类型任务的差异化处理等。 现有研究往往只关注某一方面,缺乏系统性。
● 缺乏统一解决方案:数据增强LLM应用并非“一刀切”,实际应用中,性能低下往往源于未能正确识别任务的核心重点,或任务本身需要多种能力的结合,而这些能力需要被分解才能更好地解决。
>> 具体的解决方案:论文的核心思路是根据查询类型对任务进行分类,并针对不同类型的查询提出相应的解决方案。
● 查询类型分类:论文将查询分为四个级别:
**** 级别 1:显式事实查询 (Explicit Fact Queries):直接从数据中提取事实,无需推理。解决方案:RAG。
**** 级别 2:隐式事实查询 (Implicit Fact Queries):需要结合多个事实进行推理,答案并非直接呈现。解决方案:迭代式RAG、基于图/树的RAG、自然语言到SQL查询。
**** 级别 3:可解释的推理查询 (Interpretable Rationale Queries):需要理解和应用数据中明确提供的领域特定推理规则。解决方案:提示微调 (Prompt Tuning)、思维链提示 (CoT Prompting)。
**** 级别 4:隐藏的推理查询 (Hidden Rationale Queries):需要从数据中推断出未明确记录的推理规则和逻辑关系。解决方案:离线学习、上下文学习 (ICL)、微调 (Fine-tuning)。
● RAG 的增强:针对级别1和级别2查询,论文详细讨论了RAG技术的改进,包括数据预处理(多模态数据处理、分块优化)、数据检索(稀疏检索、稠密检索、混合检索,以及查询-文档对齐的多种方法)、响应生成(处理检索信息冲突,联合训练检索器和生成器)等方面的增强。
● 提示工程和思维链:针对级别3查询,论文介绍了提示微调和思维链提示等技术,以及如何优化提示以更好地引导LLM遵循外部规则。
● 离线学习和上下文学习:针对级别4查询,论文探讨了离线学习方法(从数据中提取规则和指导方针)、上下文学习(利用示例进行学习)以及微调技术。 这些方法着重于从数据中学习隐含的推理规则和解决问题的策略。
>> 核心思路步骤:
● 任务分析:首先分析任务的类型,确定属于哪个查询级别。
● 方法选择:根据查询级别选择合适的技术方案。
● 数据处理:对外部数据进行预处理,例如清洗、分块、索引等。
● 数据检索:使用合适的检索方法获取相关数据。
● 响应生成:利用LLM生成答案,并根据需要进行推理和解释。
● 性能评估:评估系统的性能,并进行迭代改进。
>> 优势:
● 系统性:论文提供了一个系统性的框架,用于理解和解决数据增强LLM应用中的各种挑战。
● 全面性:论文涵盖了多种技术方案,并对它们的优缺点进行了详细的分析。
● 实用性:论文提供了许多具体的例子和案例研究,可以帮助开发者更好地理解和应用这些技术。
>> 结论和观点:
● 数据增强LLM应用并非“一刀切”,需要根据任务类型选择合适的技术方案。
● 论文提出的四级查询分类方法,可以帮助开发者更好地理解任务的复杂性,并选择合适的技术方案。
● 将领域知识注入LLM主要有三种方式:上下文注入、训练小型模型、直接微调大型模型。每种方式都有其优缺点,需要根据具体情况选择。
● 实际应用中,往往需要结合多种方法来解决问题。
总而言之,这篇论文为构建高效的数据增强LLM应用提供了一个全面的指南,强调了任务分析和方法选择的重要性,并对各种相关技术进行了深入探讨。 它不仅总结了现有技术,也指出了未来研究的方向。
目录
Figure 1: Main Focus of Four Level Queries
Figure 3: Three Types of Query-Document Alignment
Figure 4: Demonstration of Rationale Queries
Figure 5: Summary of Main Techniques for Different Query Levels in Data augmented LLM applications
《Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely》翻译与解读
地址 | 论文地址:https://arxiv.org/abs/2409.14924 |
时间 | 2024年9月23日 |
作者 | 微软亚洲研究院团队 |
Abstract
Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications. | 配备了外部数据的大型语言模型(LLMs)在完成现实世界任务方面展现出了惊人的能力。将外部数据集成到LLMs中的技术,如检索增强生成(RAG)和微调,正越来越受到关注并得到广泛应用。然而,在各种专业领域有效部署数据增强LLMs仍面临重大挑战。这些挑战涵盖了各种问题,从准确检索相关数据到准确理解用户意图,再到充分挖掘LLMs的推理能力以解决复杂任务。我们认为,没有一种适用于所有LLM数据增强应用的一刀切解决方案。实际上,表现不佳通常是由未能正确识别任务的核心重点或因为任务本身需要结合多种能力,这些能力必须分离并加以解决以获得更好的结果。在这项调查中,我们提出了一种RAG(Reasoning at the Granularity of Facts)任务分类方法,根据所需的外部数据类型和任务的主要关注点将用户查询分为四个级别:显式事实查询、隐式事实查询、可解释的推理查询和隐藏的推理查询。我们定义了这些级别的查询,提供了相关的数据集,并总结了应对这些挑战的最有效技术。最后,我们讨论了将外部数据集成到LLM(大型语言模型)中的三种主要形式:上下文、小型模型和微调,突出了它们的各自优势、局限性和适合解决的问题类型。这项工作旨在帮助读者全面了解和分解构建LLM应用程序所需的数据要求和关键瓶颈,为解决不同的挑战提供解决方案,并作为系统开发此类应用程序的指南。 |
1、Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities, including extensive world knowledge and sophisticated reasoning skills. Despite these advancements, there are significant challenges in effectively deploying them across various specialized fields. These challenges include issues like model hallucinations, misalignment with domain-specific knowledge, among others. Incorporating domain-specific data, particularly private or on-premise data that could not be included in their initial training corpus, is crucial for tailoring LLM applications to meet specific industry needs. Through techniques like RAG and fine tuning, data augmented LLM applications have demonstrated advantages over applications built solely on generic LLMs, in several aspects: >> Enhanced Professionalism and Timeliness: The data used to train LLMs often lags in timeliness and may not cover all domains comprehensively, especially proprietary data owned by users. Data augmented LLM applications address this issue by providing more detailed and accurate answers for complex questions, allowing for data updates and customization. >> Alignment with Domain Experts: Through the use of and learning from domain-specific data, data augmented LLM applications can exhibit capabilities more like domain experts, such as doctors and lawyers. >> Reduction in Model Hallucination: Data augmented LLM applications generate responses based on real data, grounding their reactions in facts and significantly minimizing the possibility of hallucinations. >> Improved Controllability and Explainability: The data used can serve as a reference for the model’s predictions, enhancing both controllability and explainability. | 大型语言模型(LLMs)已经展现出了令人瞩目的能力,包括广泛的世界知识和复杂的推理技能。尽管已经取得了这些进步,但在将LLM应用于各种专业领域时仍然存在许多挑战。这些挑战包括模型幻觉、与特定领域知识不匹配等问题。将特定领域的数据(特别是无法纳入初始训练语料库的私有或本地数据)纳入其中对于定制LLM应用程序以满足特定行业需求至关重要。通过使用RAG和微调等技术,数据增强的LLM应用程序在多个方面已经显示出比仅基于通用LLM的应用程序的优势: >> 提高专业性和及时性:用于训练LLM的数据往往在时效性上落后,并且可能无法全面涵盖所有领域,尤其是用户拥有的专有数据。数据增强的LLM应用程序通过为复杂问题提供更详细和准确的答案来解决这一问题,并允许进行数据更新和定制。 >> 与领域专家的一致性:通过使用和从特定领域的数据中学习,数据增强的LLM应用程序可以表现出更像领域专家的能力,例如医生和律师。 >> 减少模型幻想:数据增强的LLM应用程序基于真实的数据生成响应,将反应建立在事实的基础上,并极大地减少了幻想的可能性。 >> 提高可控性和可解释性:使用的数据可以作为模型预测的参考,提高可控性和可解释性。 |
Despite the enthusiasm for these advancements, developers often struggle and have to invest a significant amount of human labor to meet its expectations (e.g., achieving a high success rate in question answering). Numerous studies [1, 2, 3, 4, 5] highlight the challenges and frustrations involved in constructing a data augmented LLM applications based on technologies like RAG and fine-tuning, particularly in specialized domains such as the legal field, healthcare, manufacturing, and others. These challenges span a wide range, from constructing data pipelines (e.g., data processing and indexing) to leveraging LLMs’ capabilities to achieve complex intelligent reasoning. For example, in applications of finance, there is a frequent need to understand and utilize high-dimensional time series data, whereas in healthcare, medical images or time-series medical records are often essential. Enabling LLMs to comprehend these varied forms of data represents a recurring challenge. On the other hand, in legal and mathematical applications, LLMs typically struggle to grasp long-distance dependencies between different structures. Additionally, depending on the specific application domain, there are increased demands for the interpretability and consistency of LLM responses. The inherent nature of LLMs tends to be characterized by low interpretability and high uncertainty, which poses significant challenges. Enhancing the transparency of LLMs and reducing their uncertainty are critical for increasing trust and reliability in their outputs, especially in fields where precision and accountability are paramount. | 尽管人们对这些进步充满热情,但开发人员往往会遇到困难,不得不投入大量的人力劳动来满足这些期望(例如,在问答任务中达到较高的成功率)。许多研究[1, 2, 3, 4, 5]强调了基于RAG和微调等技术构建数据增强LLM应用程序所面临的挑战和挫折,尤其是在法律、医疗、制造业等专业领域。 这些挑战涵盖了广泛的范围,从构建数据管道(例如数据处理和索引)到利用LLM的能力实现复杂的智能推理。例如,在金融应用中,经常需要理解和利用高维时序数据,而在医疗领域,医学图像或时间序列医学记录通常是必不可少的。让LLM理解这些形式多样的数据是一项反复出现的挑战。另一方面,在法律和数学应用中,LLM通常难以理解不同结构之间的远距离依赖关系。此外,根据具体应用领域,对LLM响应的可解释性和一致性也有更高的要求。LLMs的固有特性通常表现为低可解释性和高不确定性,这带来了重大挑战。提高LLM的透明度并降低其不确定性对于增加对其输出的信任和可靠性至关重要,尤其是在精度和责任至关重要的领域。 |
Through extensive discussions with domain experts and developers, and by carefully analyzing the challenges they face, we have gained a deep understanding that data augmented LLM applicationsis not a one-size-fits-all solution. The real-world demands, particularly in expert domains, are highly complex and can vary significantly in their relationship with given data and the reasoning difficulties they require. However, developers often do not realize these distinctions and end up with a solution full of performance pitfalls (akin to a house with leaks everywhere). In contrast, if we could fully understand the demands at different levels and their unique challenges, we could build applications accordingly and make the application steadily improve (like constructing a solid and reliable house step by step). Yet, research efforts and existing relevant surveys [6, 7, 8, 9, 10, 11, 12, 13] frequently focus on only one of these levels or a particular topic of technologies. This has motivated us to compile this comprehensive survey, which aims to clearly define these different levels of queries, identify the unique challenges associated with each(Figure 1) , and list related works and efforts addressing them. This survey is intended to help readers construct a bird’s-eye view of data augmented LLM applications and also serve as a handbook on how to approach the development of such applications systematically. | 通过与领域专家和开发人员进行广泛的讨论,并仔细分析他们面临的挑战,我们深刻认识到,数据增强的LLM应用并非一劳永逸的解决方案。特别是在专家领域,现实世界的需求高度复杂,与给定数据和所需的推理难度之间的关系可能存在显著差异。然而,开发人员往往没有意识到这些差异,最终的结果往往是充满性能陷阱(类似于到处漏水的房子)。相反,如果我们能够完全理解不同层次的需求及其独特的挑战,我们就可以相应地构建应用程序,并逐步改进应用程序(就像建造一座坚固可靠的房子一样)。 然而,研究努力和现有的相关调查(6, 7, 8, 9, 10, 11, 12, 13)往往只关注其中一个层次或特定的技术。这促使我们编写了这份全面的调查报告,旨在明确定义这些不同层次的查询,识别与每个层次相关的独特挑战(如图1所示),并列出相关工作和努力来解决这些问题。这份调查报告旨在帮助读者构建数据增强LLM应用程序的整体视图,并作为系统地开发此类应用程序的手册。 |
Figure 1: Main Focus of Four Level Queries
Figure 3: Three Types of Query-Document Alignment
Figure 4: Demonstration of Rationale Queries
Figure 5: Summary of Main Techniques for Different Query Levels in Data augmented LLM applications
Figure 6: Three ways to inject specific domain data into an LLM: a) extracting part of the domain data based on the query as context input for the LLM, b) training a smaller model with specific domain data, which then guides the integration of external information subsequently input into the LLM, and c) directly using external domain knowledge to fine-tune a general large language model to become a domain-expert model.图6:将特定领域数据注入LLM的三种方法:a)根据查询提取部分领域数据作为LLM的上下文输入;b)使用特定领域数据训练较小的模型,然后指导将外部信息输入LLM的整合;c)直接使用外部领域知识微调通用大型语言模型,使其成为领域专家模型。
Conclusion
In this paper, we delineate data augmented LLM applications into four distinct categories based on the primary focus of queries, each facing unique challenges and thus requiring tailored solutions, as illustrated in Figure 5. For queries related to static common knowledge, deploying a general LLM through a Chain of Thought methodology is effective. For explicit fact queries, the main challenge involves pinpointing the exact location of facts within a database, thus making a basic RAG the method of choice. In the case of implicit fact queries, which require the collation of multiple related facts, iterative RAG and RAG implementations on graph or tree structures are preferred for their ability to concurrently retrieve individual facts and interconnect multiple data points. When extensive data linkage is necessary, text-to-SQL techniques prove indispensable, leveraging database tools to facilitate external data searches. For interpretable rationale queries, advancements through prompt tuning and CoT prompting are critical to enhance LLMs’ compliance with external directives. The most formidable are hidden rationale queries, which demand the autonomous synthesis of problem-solving approaches from extensive data sets. Here, offline learning, in-context learning , and fine-tuning emerge as vital methodologies. | 在本文中,我们根据查询的主要焦点将数据增强LLM应用划分为四个不同的类别,如图5所示。每个类别都面临独特的挑战,因此需要定制的解决方案。 >> 对于与静态常识相关的查询,通过“思维链”方法部署通用LLM是有效的。 >> 对于显式事实查询,主要挑战在于在数据库中准确定位事实的位置,因此基本的RAG方法成为首选。 >> 在隐式事实查询的情况下,需要对多个相关事实进行整理,在图或树结构上的迭代RAG和RAG实现是首选,因为它们能够同时检索单个事实和连接多个数据点。当需要广泛的数据链接时,文本到SQL技术被证明是必不可少的,它利用数据库工具来辅助外部数据搜索。 >> 对于可解释的推理查询,通过提示微调和CoT提示进行改进是关键,以增强LLM对外部指令的遵循。 最令人生畏的是隐式推理查询,它需要从大量数据集自主合成问题解决方法。在此,离线学习、上下文学习和微调作为关键的策略出现。 |
Prior to the development of a targeted LLM application, as domain experts, we must acquire an in-depth understanding of the intended task, ascertain the complexity level of the associated queries, and select corresponding technological approaches for resolution. These methods principally inject knowledge into LLMs via three mechanisms, as depicted in Figure 6: a) extracting part of the domain data based on the query as context input for the LLM, b) training a smaller model with specific domain data, which then guides the integration of external information subsequently input into the LLM, and c) directly using external domain knowledge to fine-tune a general large language model to become a domain-expert model. These strategies differ in their requirements for data volume, training duration, and computational resources, escalating respectively. Knowledge injection through context provides better interpretability and stability but faces limitations due to the finite context window and potential information loss in the middle [40], ideally suited for scenarios where data can be succinctly explained in shorter texts. However, this method challenges the model’s retrieval capabilities and knowledge extraction ability. The small model approach offers the advantage of reduced training times and the capacity to assimilate considerable amounts of data, yet its efficacy is contingent upon the model’s capabilities, potentially capping the LLM’s performance for more complex tasks and incurring additional training costs with data increasing. Fine-tuning facilitates the utilization of large model capacities with extensive domain-specific data, yet its impact on the LLM strongly depends on the design of the data used. Employing out-of-domain factual data for fine-tuning may inadvertently lead to the generation of more erroneous outputs by the LLM, while also risking the loss of previously known domain knowledge and the neglect of unencountered tasks during fine-tuning [110, 195]. Therefore, choosing an appropriate data injection strategy into the LLM requires a thorough understanding of one’s data sources and judicious decision-making based on this insight. | 在开发针对特定任务的LLM应用之前,作为领域专家,我们必须深入了解预期的任务,确定相关查询的复杂程度,并选择相应的技术方法来解决问题。这些方法主要通过三种机制向LLM注入知识,如图6所示: a)根据查询从领域数据中提取部分内容作为LLM的上下文输入; b)使用特定领域的数据训练较小的模型,然后指导将外部信息输入LLM; c)直接使用外部领域的知识微调通用大型语言模型以成为领域专家模型。 这些策略在数据量、训练时间和计算资源方面的要求各不相同,并且依次增加。 通过上下文注入知识可以提供更好的可解释性和稳定性,但会受到有限的上下文窗口和中间可能的信息丢失的限制[40],理想情况下适用于数据可以用较短的文本简洁说明的场景。然而,这种方法挑战了模型的检索能力和知识提取能力。小型模型方法的优势在于减少了训练时间,能够处理大量数据,但其有效性取决于模型的能力,这可能限制LLM在更复杂任务上的表现,并且随着数据量的增加,还会产生额外的训练成本。 微调有助于在大量特定领域的数据上利用大型模型的容量,但其对LLM的影响强烈依赖于所使用的数据的设计。使用域外事实数据进行微调可能会无意中导致LLM生成更多错误的输出,同时也有可能丢失先前已知的领域知识,并在微调过程中忽略未遇到的任务[110, 195]。 因此,将适当的数据注入LLM需要对数据来源有深入的了解,并基于这一洞察力做出明智的决策。 |
Moreover, in practical scenarios, data augmented LLM applicationstypically involves a combination of diverse query types, necessitating developers to engineer a routing pipeline that integrates multiple methodologies to effectively tackle these multifaceted challenges. | 此外,在实际应用场景中,数据增强的LLM应用程序通常需要结合多种查询类型,因此需要开发人员设计一个路由管道,将多种方法集成在一起,以有效地解决这些多方面的挑战。 |