大模型之RAG知识库实战（高级检索）

常耀斌

已于 2025-05-24 19:18:14 修改

阅读量1.1k

点赞数 15

CC 4.0 BY-SA版权

文章标签：人工智能深度学习机器学习

于 2025-05-17 10:28:03 首次发布

本文链接：https://blog.csdn.net/Peter_Changyb/article/details/148024615

当我们将大模型应用于实际业务场景时会发现，通用的基础大模型基本无法满足实际业务需求，主要有以下几方面原因：

知识的局限性：模型自身的知识完全源于它的训练数据，而现有的主流大模型（deepseek、文心一言、通义千问…）的训练集基本都是构建于网络公开的数据，对于一些实时性的、非公开的或离线的数据是无法获取到的。
幻觉问题：所有的AI模型的底层原理都是基于数学概率，其模型输出实质上是一系列数值运算，大模型也不例外，所以它经常会一本正经地胡说八道，尤其是在大模型自身不具备某一方面的知识或不擅长的场景。
数据安全性：对于企业来说，数据安全至关重要，没有企业愿意承担数据泄露的风险，将自身的私域数据上传第三方平台进行训练。这也导致完全依赖通用大模型自身能力的应用方案不得不在数据安全和效果方面进行取舍。

RAG的四个核心工作流程？

索引化（Indexing）：将原始数据清洗、分块并编码为向量，存储在向量数据库（如Faiss、Milvus）中，形成可检索的知识库。
检索（Retrieval）：混合检索（向量+关键词）、元数据增强、子查询优化。
后检索重排（Reranking）：将检索后的关联文本，及前置信息进行筛选、过滤、重排。
生成（Generation）：结合检索结果与LLM生成答案，引用来源提高可信度。

如何更好分块？

Transformer模型具有固定的输入序列长度，即使输入上下文窗口很大，一个或几个句子的向量也比几页文本的平均向量更好地表示其语义，因此需要对数据进行分块。分块将初始文档分割成一定大小的块，尽量不要失去语义含义，将文本分割成句子或段落，而不是将单个句子分成两部分。有多种文本分割器实现能够完成此任务。

块的大小是一个需要考虑的重要参数-它取决于您使用的嵌入模型及其令牌容量，标准转换器编码器模型（例如基于BERT的句子转换器）最多需要512个令牌。OpenAI ada-002能够处理更长的序列，如8191个标记，但这里就需要权衡是留有足够的上下文供大语言模型进行推理，还是留足够具体的文本表征以便有效地执行检索。

例如，在语义搜索中，我们对文档语料库进行索引，每个文档都包含有关特定主题的有价值的信息。通过应用有效的分块策略，我们可以确保我们的搜索结果准确地捕捉用户查询的本质。如果我们的块太小或太大，可能会导致搜索结果不精确或错过显示相关内容的机会。根据经验，如果文本块在没有周围上下文的情况下对人类有意义，那么它对语言模型也有意义。因此，找到语料库中文档的最佳块大小对于确保搜索结果的准确性和相关性至关重要

如果常见的分块方法（例如固定分块）无法轻松应用于用例，那么这里有一些提示可以帮助您找到最佳的分块大小。

预处理数据：在确定应用程序的最佳块大小之前，您需要首先预处理数据以确保质量。例如，如果您的数据是从网络检索的，您可能需要删除HTML标签或只会增加噪音的特定元素。

选择块大小范围：数据经过预处理后，下一步是选择要测试的潜在块大小范围。如前所述，选择应考虑内容的性质（例如，短消息或冗长的文档）、您将使用的嵌入模型及其功能（例如，令牌限制）。目标是在保留上下文和保持准确性之间找到平衡。首先探索各种块大小，包括用于捕获更精细语义信息的较小块（例如，128或256个标记）和用于保留更多上下文的较大块（例如，512或1024个标记）。

评估每个块大小的性能：为了测试各种块大小，您可以使用多个索引或具有多个命名空间的单个索引。使用代表性数据集，为要测试的块大小创建嵌入并将它们保存在索引中。然后，您可以运行一系列查询，可以评估其质量，并比较不同块大小的性能。这很可能是一个迭代过程，您可以针对不同的查询测试不同的块大小，直到可以确定内容和预期查询的最佳性能块大小

Embedding干什么？

Embedding 是一种将文字序列（如词、句子或文档）转换为向量表示（固定维度的向量）的技术

模型目标：使得具有相似语义的文字序列对应的向量尽可能接近（即相似度高），而语义不同的文字序列对应的向量尽可能远离（即相似度低）

作用：通过数学计算向量之间的距离，快速检索出相似度最高的文字序列。

如何索引？

分块策略（Chunking Strategy）：将文档分割成固定数量的标记（例如 100、256、512 个 tokens）的块（chunks）
元附加数据（Metadata Attachments）：给文本块附加元数据信息，如页码、文件名、作者、类别时间戳，以及段落摘要、潜在问题等，以便基于这些元数据过滤检索，限制检索范围
结构化索引（Structural Index）：建立文档的层次结构或图结构，如对文本分块后建立文本块间的父子关系，或者以文本单元（段落、表格、页等）为节点、以语义/文本相似度为边来构建文档的图谱

检索阶段如果优化？

检索阶段是RAG的基石，直接影响生成结果的质量。优化方向包括：

1. 检索器（Retriever）的增强

混合检索策略：结合关键词检索（如BM25）和稠密向量检索（如DPR、Sentence-BERT），利用两者的互补性。例如，BM25擅长精确匹配关键词，向量检索擅长语义相似性。
重排序（Re-ranking）：对检索结果进行二次排序，使用更精细的模型（如Cross-Encoder、ColBERT）计算相关性得分，提升Top-K结果的精度。
多模态检索：扩展检索内容到图像、表格等多模态数据，例如用CLIP模型对齐文本和图像的嵌入空间。

2. 数据预处理优化

分块策略：根据数据类型调整文本分块（Chunking）策略：
- 固定长度分块：适用于通用场景，但可能切断语义。
- 动态分块：基于语义边界（如段落、标题）分割，或使用NLP模型（如Spacy）识别句子边界。
- 重叠分块：相邻块间添加重叠部分（如10%长度），避免信息截断。
元数据增强：为每个数据块附加元数据（如来源、时间、实体），检索时通过元数据过滤无关内容。
数据清洗：去除噪声（如HTML标签、重复文本）、规范化（小写化、拼写校正）、实体识别（链接到知识库）。

3. 向量嵌入（Embedding）优化

领域适配微调：在领域数据上微调预训练嵌入模型（如BERT、RoBERTa），提升语义表示能力。
多向量编码：对长文档生成多个向量（如按段落），检索时综合多个向量得分。
量化与压缩：使用PQ（Product Quantization）或二进制编码减少向量存储和计算开销。

4. 检索后处理

去重与聚合：合并相似或重复的检索结果，避免冗余信息干扰生成。
假设性文档嵌入（HyDE）：让LLM生成一个“假设答案”，用其向量检索真实文档，提升对模糊查询的鲁棒性。

生成阶段如何优化？

生成阶段需确保结果与检索内容一致且符合用户需求：

1. 提示工程（Prompt Engineering）

结构化指令：明确要求生成内容基于检索结果，例如：

少样本学习（Few-shot）：在提示中加入示例，引导模型模仿格式和逻辑。
链式思考（Chain-of-Thought）：要求模型先输出推理过程，再生成最终答案，提升可解释性。

2. 生成模型优化

领域微调：在领域数据上微调生成模型（如GPT-3、Llama 2），使其更适应特定术语和风格。
约束生成：通过前缀树（Trie）限制输出词汇，或使用API强制引用检索结果中的实体。
多步生成：先生成草稿，再基于检索内容进行修订，或分解复杂问题为子问题逐步回答。

3. 后处理与验证

事实性校验：对比生成结果与检索内容，通过NLI（自然语言推理）模型检测矛盾。
引用溯源：自动标注生成内容对应的检索文档片段，方便用户验证。
毒性过滤：使用分类模型过滤生成内容中的有害信息。

如何提升RAG的效率和可扩展性？

1. 缓存与索引

结果缓存：缓存高频查询的检索和生成结果，减少重复计算。
增量索引：对新数据实时更新索引，避免全量重建。

2. 异步流水线

并行化检索与生成：在生成模型处理当前结果时，预加载下一组检索内容。
分布式架构：将检索器与生成器部署为独立服务，通过负载均衡应对高并发。

3. 延迟与资源权衡

分级检索：先快速返回粗粒度结果，再异步补充细节。
模型蒸馏：用蒸馏后的轻量模型（如TinyBERT）替代原始大模型，降低计算成本。

高级RAG之检索技术

混合搜索模型
到目前为止，我们一直在讨论在向量数据库中搜索查询——我们在其中存储嵌入向量。让我们更进一步，将其与传统的基于关键字的搜索相结合。这种方法确保检索系统可以处理各种查询类型；从需要精确关键字匹配的查询到需要理解上下文的更复杂的查询。

让我们建立一个混合搜索模型。我们将使用 Elasticsearch 作为传统搜索机制，并使用 faiss 作为向量数据库进行语义搜索。

1.1.1. 创建Elasticsearch索引
首先假设所有文档都在“documents”字典中，并且我们已经获取了嵌入向量并将它们存储在字典中。以下代码块连接到 Elasticsearch 8.13.4 并为给定的示例文档创建索引。

ES_NODES = "http://localhost:9200"

documents = [
{"id": 1, "text": "How to start with Python programming.", "vector": [0.1, 0.2, 0.3]},
{"id": 2, "text": "Advanced Python programming tips.", "vector": [0.1, 0.3, 0.4]},
# More documents...
]

from elasticsearch import Elasticsearch

es = Elasticsearch(
hosts=ES_NODES,
)
for doc in documents:
es.index(index="documents", id=doc['id'], document={"text": doc['text']})
AI写代码
python
运行

1.1.2. 创建Faiss索引

在这一部分中，我们使用 faiss 作为向量数据库并对向量进行索引。

import numpy as np
import faiss

dimension = 3 # Assuming 3D vectors for simplicity
faiss_index = faiss.IndexFlatL2(dimension)
vectors = np.array([doc['vector'] for doc in documents])
faiss_index.add(vectors)
AI写代码
python
运行
3.1.3. 混合索引

下面代码将Elasticsearch关键词搜索和faiss向量语义匹配进行混合搜索。

def hybrid_search(query_text, query_vector, alpha=0.5):
# Perform a keyword search using Elasticsearch on the "documents" index, matching the provided query_text.
response = es.search(index="documents", query={"match": {"text": query_text}})
# Extract the document IDs and their corresponding scores from the Elasticsearch response.
keyword_results = {hit['_id']: hit['_score'] for hit in response['hits']['hits']}

# Prepare the query vector for vector search: reshape and cast to float32 for compatibility with Faiss.
query_vector = np.array(query_vector).reshape(1, -1).astype('float32')
# Perform a vector search with Faiss, retrieving indices of the top 5 closest documents.
_, indices = faiss_index.search(query_vector, 5)
# Create a dictionary of vector results with scores inversely proportional to their rank (higher rank, higher score).
vector_results = {str(documents[idx]['id']): 1/(rank+1) for rank, idx in enumerate(indices[0])}

# Initialize a dictionary to hold combined scores from keyword and vector search results.
combined_scores = {}
# Iterate over the union of document IDs from both keyword and vector results.
for doc_id in set(keyword_results.keys()).union(vector_results.keys()):
# Calculate combined score for each document using the alpha parameter to balance the influence of both search results.
combined_scores[doc_id] = alpha * keyword_results.get(doc_id, 0) + (1 - alpha) * vector_results.get(doc_id, 0)

# Return the dictionary containing combined scores for all relevant documents.
return combined_scores

# Example usage
query_text = "Python programming"
query_vector = [0.1, 0.25, 0.35]
# Execute the hybrid search function with the specified query text and vector.
results = hybrid_search(query_text, query_vector)
# Print the results of the hybrid search to see the combined scores of documents.
print(results)
AI写代码
python
运行

该hybrid_search 函数首先使用 Elasticsearch 进行关键字搜索。下一步，它使用 Faiss 执行向量搜索，Faiss 返回前五个最接近的文档的索引，这些索引用于根据文档的排名创建反向分数文档（即，最接近的文档得分最高）。

一旦我们获得了 Elasticsearch 和 Faiss 的结果，我们就可以把这两种方法的得分结合起来。每个文档的最终得分是使用参数 alpha加权平均值计算得到，如果alpha=0.5，意味这两个结果赋予了相同的权重。

完整的代码，可以参考[2]

1.2 微调嵌入模型

微调嵌入模型是增强检索增强生成系统性能的有效步骤。微调预训练模型有助于模型理解特定领域或数据集的细微差别，从而可以显著提高检索到的文档的相关性和准确性。

我们可以用以下几个要点来总结微调模型的重要性：

增强语义理解：微调有助于模型掌握原始训练数据中可能无法很好体现的特定领域的术语和概念。
适应内容的更新：某些领域（例如医学或技术领域）的信息正在迅速变化，通过微调保持嵌入更新可以保持系统的有效性。
提高检索精度：通过使嵌入空间与目标用例更紧密地对齐，微调可确保更可靠地检索语义相关的文本。
1.2.1 准备微调数据

以下代码块是微调模型的第一步。它初始化用于微调预训练屏蔽语言模型的管道，加载模型和标记器，并调整设备兼容性（GPU 或 CPU）。

初始化后，它会通过标记化和动态标记掩码处理样本数据集。此设置可让模型为自监督学习做好准备，在自监督学习中，它会预测掩码标记，从而增强其对输入数据的语义理解。

# Define the model name using a pre-trained model from the Sentence Transformers library
model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Load the tokenizer for the specified model from Hugging Face's transformers library
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model for masked language modeling based on the specified model
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Determine if a GPU is available and set the device accordingly; use CPU if GPU is not available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the appropriate device (GPU or CPU)
model.to(device)

# Define a generator function to create a dataset; this should be replaced with actual data loading logic
def dataset_generator():
# Example dataset composed of individual sentences; replace with your actual dataset sentences
dataset = ["sentence1", "sentence2", "sentence3"]
# Yield each sentence as a dictionary with the key 'text'
for sentence in dataset:
yield {"text": sentence}

# Create a dataset object using Hugging Face's Dataset class from the generator function
dataset = Dataset.from_generator(dataset_generator)

# Define a function to tokenize the text data
def tokenize_function(example):
# Tokenize the input text and truncate it to the maximum length the model can handle
return tokenizer(example["text"], truncation=True)

# Apply the tokenization function to all items in the dataset, batch processing them for efficiency
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Initialize a data collator for masked language modeling which randomly masks tokens
# This is used for training the model in a self-supervised manner
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
AI写代码
python
运行

1.2.2 开始微调模型

一旦数据准备好了，我们就可以开始微调阶段。在这个阶段，我们将使用模型的现有权重并开始更新它们。

以下代码块使用 Hugging Face 的 API 设置并执行语言模型的训练Trainer。它首先定义训练参数（时期、批量大小和学习率等）。Trainer然后，对象使用这些设置以及预加载的模型、标记化数据集和用于屏蔽语言建模的数据整理器（模型、标记化数据集和数据整理器是在上一步中创建的）。训练完成后，将保存新的更新模型及其标记器以供下一步使用。

# Define training arguments to configure the training session
training_args = TrainingArguments(
output_dir="output", # Directory where the outputs (like checkpoints) will be saved
num_train_epochs=3, # Total number of training epochs to perform
per_device_train_batch_size=16, # Batch size per device during training
learning_rate=2e-5, # Learning rate for the optimizer
)

# Initialize the Trainer, which handles the training loop and evaluation
trainer = Trainer(
model=model, # The model to be trained, already loaded and configured
args=training_args, # The training arguments defining the training setup
train_dataset=tokenized_datasets, # The dataset to train on, already tokenized and prepared
data_collator=data_collator, # The data collator that handles input formatting and masking
)

# Start the training process
trainer.train()

# Define the paths where the fine-tuned model and tokenizer will be saved
model_path = "./model"
tokenizer_path = "./tokenizer"

# Save the fine-tuned model to the specified path
model.save_pretrained(model_path)

# Save the tokenizer used in training to the specified path
tokenizer.save_pretrained(tokenizer_path)
AI写代码
python
运行

1.2.3 使用微调后的模型

现在是时候使用保存的模型和标记器来生成嵌入向量了。以下代码块用于此目的。

以下代码块加载模型和标记器以生成给定句子的嵌入。首先，从保存的路径加载模型和标记器，并将其加载到 GPU 或 CPU。句子（在本文的上下文中，它们是查询）被标记化。模型在不更新其参数的情况下处理这些输入，这称为推理模式，可以使用with torch.no_grad()。我们不使用此模型来预测下一个标记；相反，我们的目标是从模型的隐藏状态中提取嵌入向量。最后一步，这些嵌入向量被移回 CPU。

# Load the tokenizer and model from saved paths, ensuring the model is allocated to the appropriate device (GPU or CPU)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
model = AutoModelForMaskedLM.from_pretrained(model_path).to(device)

# Define a function to tokenize input sentences, configuring padding and truncation to handle variable sentence lengths
def tokenize_function_embedding(example):
return tokenizer(example["text"], padding=True, truncation=True)

# List of example sentences to generate embeddings for
sentences = ["This is the first sentence.", "This is the second sentence."]

# Create a Dataset object directly from these sentences
dataset_embedding = Dataset.from_dict({"text": sentences})

# Apply the tokenization function to the dataset, preparing it for embedding generation
tokenized_dataset_embedding = dataset_embedding.map(tokenize_function_embedding, batched=True, batch_size=None)

# Extract 'input_ids' and 'attention_mask' needed for the model to understand which parts of the input are padding and which are actual content
input_ids = tokenized_dataset_embedding["input_ids"]
attention_mask = tokenized_dataset_embedding["attention_mask"]

# Convert these lists into tensors and ensure they are on the correct device (GPU or CPU) for processing
input_ids = torch.tensor(input_ids).to(device)
attention_mask = torch.tensor(attention_mask).to(device)

# Generate embeddings using the model without updating gradients to save computational resources
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
# Extract the last layer's hidden states as embeddings, specifically the first token (typically used in BERT-type models for representing sentence embeddings)
embeddings = outputs.hidden_states[-1][:, 0, :]

# Move the embeddings from the GPU back to CPU for easy manipulation or saving
embeddings = embeddings.cpu()

# Print each sentence with its corresponding embedding vector
for sentence, embedding in zip(sentences, embeddings):
print(f"Sentence: {sentence}")
print(f"Embedding: {embedding}\n")
AI写代码
python
运行

高级RAG之检索后处理

检索到相关信息后，还需要以正确顺序喂给大模型。在接下来的 2 个小节中，我们将解释如何使用摘要和重新排序来提高 RAG 的质量。

1.1 对响应进行摘要
如果在索引过程中在数据库中存储了大量的块文本向量，则此步骤可能是必要的。如果文本已经很小，则可能不需要此步骤。

以下代码块可用于摘要过程。以下代码块使用该transformers库通过预先训练的 BART 模型来提取文本摘要。该函数summarize_text接收文本并使用该模型根据定义的最大和最小长度参数生成简洁的摘要。

from transformers import pipeline
def summarize_text(text, max_length=130):

# Load a pre-trained summarization model from Hugging Face's model hub.
# 'facebook/bart-large-cnn' is chosen for its proficiency in generating concise summaries.
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# The summarizer uses the BART model to condense the input text into a summary.
# 'max_length' specifies the maximum length of the summary output.
# 'min_length' sets the minimum length to ensure the summary is not too terse.
# 'do_sample' is set to False to use a deterministic approach for summary generation.
summary = summarizer(text, max_length=max_length, min_length=30, do_sample=False)

# The output from the summarizer is a list of dictionaries.
# We extract the summary text from the first dictionary in the list.
return summary[0]['summary_text']

# Example text to be summarized.
# This text discusses the importance of summarization in retrieval-augmented generation systems.
long_text = "Summarization are vital steps in the workflow of retrieval-augmented generation systems. They ensure the output is not only accurate but also concise and digestible. These techniques are essential, especially in domains where the accuracy and precision of information are crucial."

# Call the summarize_text function to compress the example text.
summarized_text = summarize_text(long_text)

# Print the summarized text to see the output of the summarization model.
print("Summarized Text:", summarized_text)

AI写代码
python
运行

完整的代码，可以参考[3]

1.2 重排序和过滤
在检索过程中，您应该已经得到每个文档的“分数”——这实际上是向量与查询向量的相似度分数。此信息可用于重新排序文档并根据给定的阈值过滤结果。以下代码块显示了如何重新排序和过滤的示例。

1.2.1. 基本重排序和过滤

下面代码块定义了一个文档列表，每个文档都由一个包含 ID、文本和相关性分数的字典表示。然后它实现了两个主要功能：re_rank_documents和filter_documents。该re_rank_documents函数按相关性分数降序对文档进行排序，在重新排序后，该filter_documents函数用于排除相关性分数低于指定阈值 0.75 的任何文档。

# Define a list of documents. Each document is represented as a dictionary with an ID, text, and a relevance score.
documents = [
{"id": 1, "text": "Advanced RAG systems use sophisticated techniques for text summarization.", "relevance_score": 0.82},
{"id": 2, "text": "Basic RAG systems primarily focus on retrieval and basic processing.", "relevance_score": 0.55},
{"id": 3, "text": "Re-ranking improves the quality of responses by ordering documents by relevance.", "relevance_score": 0.89}
]

# Define a function to re-rank documents based on their relevance scores.
def re_rank_documents(docs):

# Use the sorted function to order the documents by 'relevance_score'.
# The key for sorting is specified using a lambda function, which extracts the relevance score from each document.
# 'reverse=True' sorts the list in descending order, placing documents with higher relevance scores first.
return sorted(docs, key=lambda x: x['relevance_score'], reverse=True)

# Re-rank the documents using the defined function and print the result.
ranked_documents = re_rank_documents(documents)
print("Re-ranked Documents:", ranked_documents)

# Define a function to filter documents based on a relevance score threshold.
def filter_documents(docs, relevance_threshold=0.75):

# Use a list comprehension to create a new list that includes only those documents whose 'relevance_score'
# is greater than or equal to the 'relevance_threshold'.
return [doc for doc in docs if doc['relevance_score'] >= relevance_threshold]

# Filter the re-ranked documents using the defined function with a threshold of 0.75 and print the result.
filtered_documents = filter_documents(ranked_documents)
print("Filtered Documents:", filtered_documents)
AI写代码
python
运行

1.2.2. 使用机器学习算法进行高级重排序

对于更复杂的方法，可以使用机器学习模型对文档进行重新排序。在这种方法中，挑战在于：如何知道哪些文档是相关的，以便我们可以训练机器学习模型对文档进行排序？

在这种方法中，我们需要假设我们有一个系统，该系统存储用户与系统之间的交互，并存储文档是否与给定查询相关。一旦我们有了这个数据集，我们就可以使用查询嵌入向量和文档嵌入来预测分数。

# assumung the data is stored in the following format in a database
# query_text | response_text | user_clicked

query_embeddings = get_embedding_vector(database.query_text)
response_embeddings = get_embedding_vector(database.response_text)

# create the dataset
X = concat(query_embeddings, response_embeddings)
y = database.user_clicked

model = model.train(X, y)
model.predict_proba(...)
AI写代码
python
运行

上面提供的伪代码概述了使用机器学习根据相关性对文档进行重新排序的方法，具体来说，是通过预测用户根据过去的交互找到相关文档的可能性。下面伪代码是对描述流程的分步骤说明：

Generating Embeddings（生成嵌入）：对于查询和响应文档，创建嵌入向量来捕获它们的语义内容。
Creating the Dataset（创建数据集）：这些嵌入连接起来形成特征向量（X），目标变量（y）表示用户是否点击了文档。
Model Training（模型训练）：在该数据集上训练分类模型，以根据组合查询和文档嵌入来预测文档被点击的可能性。
Prediction（预测）：训练后的模型可以预测新查询-文档对的点击概率，帮助根据预测的相关性重新对文档进行排名，以提高搜索结果的准确性。