MMRAG-DocQA多模态检索增强生成方法：源码实现分析-CSDN博客

MMRAG-DocQA多模态检索增强生成方法：源码实现分析

文章目录

MMRAG-DocQA多模态检索增强生成方法：源码实现分析

一概述

本篇文章对论文MMRAG-DocQA（arXiv:2508.00579）的源码进行详细分析，源代码可从此处下载：https://github.com/Gzy1112/MMRAG-DocQA。MMRAG-DocQA 是一个先进的多模态检索增强生成（RAG）系统，它提出了一套全新的多模态检索增强生成方法，通过精巧的 “分层索引” 和 “多粒度检索” 机制，成功地让AI在理解复杂长文档方面迈出了一大步。
该系统结合了分层索引、多粒度检索和递归抽象处理技术，为复杂的文档理解问题提供了高质量的解决方案。仔细分析和学习系统源码，对于深入理解论文的算法并且能灵活运用在自己的RAG应用中，有很大裨益。

二系统架构

三数据流程

1. 数据处理流程图

2. 文档处理详细流程

2.1 第一阶段：PDF解析

# 输入：原始PDF文档
# 处理：使用Docling进行结构化解析
# 输出：包含页面、表格、图像的JSON结构
{
    "metainfo": {
        "sha1_name": "document_hash",
        "company_name": "Company Name"
    },
    "content": {
        "pages": [
            {
                "page": 1,
                "text": "页面文本内容",
                "images": [...],
                "tables": [...]
            }
        ],
        "tables": [...]
    }
}

2.2 第二阶段：文档合并

# 输入：复杂的Docling输出结构
# 处理：简化结构，合并文本块
# 输出：简化的页面级JSON结构
{
    "metainfo": {...},
    "content": {
        "pages": [
            {
                "page": 1,
                "text": "合并后的页面文本"
            }
        ]
    }
}

2.3 第三阶段：文本分块

# 输入：页面级JSON结构
# 处理：递归字符分割，生成重叠块
# 输出：分块后的文档结构
{
    "metainfo": {...},
    "content": {
        "pages": [...],
        "chunks": [
            {
                "id": 0,
                "type": "content",
                "page": 1,
                "text": "分块文本内容",
                "length_tokens": 150
            },
            {
                "id": 1,
                "type": "serialized_table",
                "page": 1,
                "text": "表格内容",
                "table_id": "table_1"
            }
        ]
    }
}

2.4 第四阶段：索引构建

# 向量数据库：FAISS索引文件
# BM25数据库：pickle序列化文件
# RAPTOR索引：层次化摘要树结构

3. 检索增强生成流程

3.1 查询处理

问题预处理：识别比较性问题，分解为子问题
Step-Back查询：生成更抽象的查询以提高检索效果
检索策略选择：根据配置选择向量、BM25或混合检索

3.2 多粒度检索

# 页面级检索：精细粒度的内容检索
# 文档级检索：通过RAPTOR获取文档摘要
# 父文档检索：获取相关页面的上下文

3.3 结果优化

结果融合：合并不同检索策略的结果
LLM重排序：使用大模型重新排序检索结果
上下文格式化：为LLM生成结构化的上下文

3.4 答案生成

提示构建：根据问题类型构建专门的提示
LLM推理：使用配置的模型生成答案
结果验证：验证页面引用的准确性

四核心组件实现分析

1. Pipeline 配置管理

1.1 PipelineConfig 类

# pipeline.py:20-46
@dataclass
class PipelineConfig:
    def __init__(self, root_path: Path, subset_name: str = "", 
                 questions_file_name: str = "questions.json", 
                 pdf_reports_dir_name: str = "pdf_reports", 
                 serialized: bool = False, config_suffix: str = ""):
        
        self.root_path = root_path
        suffix = "_ser_tab" if serialized else ""
        
        # 路径配置
        self.subset_path = root_path / subset_name
        self.questions_file_path = root_path / questions_file_name
        self.pdf_reports_dir = root_path / pdf_reports_dir_name
        
        # 输出路径
        self.answers_file_path = root_path / f"answers{config_suffix}.json"
        self.debug_data_path = root_path / "debug_data"
        self.databases_path = root_path / f"databases{suffix}"
        
        # 数据库路径
        self.vector_db_dir = self.databases_path / "vector_dbs"
        self.documents_dir = self.databases_path / "chunked_reports"
        self.bm25_db_path = self.databases_path / "bm25_dbs"

设计特点：

使用 @dataclass 装饰器简化配置类定义
支持序列化表格的后缀路径管理
集中管理所有文件路径，便于维护

1.2 RunConfig 类

# pipeline.py:48-65
@dataclass
class RunConfig:
    use_serialized_tables: bool = False           # 是否使用序列化表格
    parent_document_retrieval: bool = False       # 是否启用父文档检索
    use_vector_dbs: bool = True                   # 是否使用向量数据库
    use_bm25_db: bool = False                     # 是否使用BM25数据库
    llm_reranking: bool = False                   # 是否启用LLM重排序
    llm_reranking_sample_size: int = 30          # 重排序样本大小
    top_n_retrieval: int = 10                     # 检索结果数量
    parallel_requests: int = 10                   # 并行请求数量
    api_provider: str = "qwen"                    # API提供商
    answering_model: str = "qwen-turbo"          # 回答模型

设计特点：

提供了丰富的配置选项，支持不同的使用场景
默认值设置合理，开箱即用
支持多种API提供商和模型选择

2. PDF 解析器实现

2.1 核心解析逻辑

# pdf_parsing.py:40-60
class PDFParser:
    def __init__(self, pdf_backend=DoclingParseV2DocumentBackend,
                 output_dir: Path = Path("./parsed_pdfs"),
                 num_threads: int = None,
                 csv_metadata_path: Path = None):
        
        self.pdf_backend = pdf_backend
        self.output_dir = output_dir
        self.doc_converter = self._create_document_converter()
        self.num_threads = num_threads
        self.metadata_lookup = {}
        
        if csv_metadata_path is not None:
            self.metadata_lookup = self._parse_csv_metadata(csv_metadata_path)

2.2 文档转换器配置

# pdf_parsing.py:77-100
def _create_document_converter(self) -> "DocumentConverter":
    from docling.document_converter import DocumentConverter, FormatOption
    from docling.datamodel.pipeline_options import PdfPipelineOptions
    
    pipeline_options = PdfPipelineOptions(artifacts_path="")
    pipeline_options.do_ocr = True                    # 启用OCR
    pipeline_options.do_table_structure = True        # 启用表格结构识别
    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
    pipeline_options.generate_page_images = True      # 生成页面图像
    pipeline_options.generate_picture_images = True   # 生成图片图像
    
    format_options = {
        InputFormat.PDF: FormatOption(
            pipeline_cls=StandardPdfPipeline,
            pipeline_options=pipeline_options
        )
    }
    
    return DocumentConverter(format_options=format_options)

技术特点：

使用 Docling v2 作为后端解析引擎
支持OCR、表格结构识别、图像生成
可配置的多线程处理
支持元数据关联

2.3 并行处理实现

# pdf_parsing.py:26-38
def _process_chunk(pdf_paths, pdf_backend, output_dir, 
                  num_threads, metadata_lookup, debug_data_path):
    """并行处理PDF块的辅助函数"""
    parser = PDFParser(
        pdf_backend=pdf_backend,
        output_dir=output_dir,
        num_threads=num_threads,
        csv_metadata_path=None
    )
    parser.metadata_lookup = metadata_lookup
    parser.debug_data_path = debug_data_path
    parser.parse_and_export(pdf_paths)
    return f"Processed {len(pdf_paths)} PDFs."

设计优势：

支持多进程并行处理，提高处理效率
每个进程创建独立的解析器实例
支持批量处理和进度跟踪

3. 文本分块器实现

3.1 分块策略

# text_splitter.py:9-15
class TextSplitter():
    def _split_page(self, page: Dict[str, any], 
                   chunk_size: int = 300, 
                   chunk_overlap: int = 50) -> List[Dict[str, any]]:
        """使用递归字符分割器分割页面文本"""
        text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            model_name="gpt-4o",
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        cleaned_text = page['text'].replace("", "") 
        chunks = text_splitter.split_text(cleaned_text)

3.2 序列化表格处理

# text_splitter.py:10-33
def _get_serialized_tables_by_page(self, tables: List[Dict]) -> Dict[int, List[Dict]]:
    """按页面分组序列化表格"""
    tables_by_page = {}
    for table in tables:
        if 'serialized' not in table:
            continue
            
        page = table['page']
        if page not in tables_by_page:
            tables_by_page[page] = []
        
        table_text = "\n".join(
            block["information_block"] 
            for block in table["serialized"]["information_blocks"]
        )
        
        tables_by_page[page].append({
            "page": page,
            "text": table_text,
            "table_id": table["table_id"],
            "length_tokens": self.count_tokens(table_text)
        })
    
    return tables_by_page

3.3 RAPTOR 集成

# text_splitter.py:46-54
def _split_report(self, file_content: Dict[str, any], 
                 serialized_tables_report_path: Optional[Path] = None):
    """分割报告并集成RAPTOR摘要"""
    chunks = []
    chunk_id = 0
    
    # 创建RAPTOR实例并添加文档
    RA = RetrievalAugmentation()
    doc_text = ''
    for page in file_content['content']['pages']:
        cleaned_text = page['text'].replace("", "")
        doc_text = doc_text + cleaned_text
    RA.add_documents(doc_text)
    
    # 保存RAPTOR索引
    SAVE_PATH = os.path.join(SAVE_PATH, file_content["metainfo"]["sha1_name"])
    RA.save(SAVE_PATH)

技术特点：

使用 LangChain 的递归字符分割器
支持序列化表格的专门处理
集成 RAPTOR 进行层次化摘要
支持令牌计数和元数据管理

4. 向量数据库构建器

4.1 向量化实现

# ingestion.py:65-88
@retry(wait=wait_fixed(20), stop=stop_after_attempt(2))
def _get_embeddings(self, text: Union[str, List[str]], 
                   model: str = "text-embedding-v4") -> List[float]:
    """获取文本嵌入向量，支持批量处理"""
    if isinstance(text, str) and not text.strip():
        raise ValueError("Input text cannot be an empty string.")
    
    if isinstance(text, list):
        text_chunks = [text[i:i + 1024] for i in range(0, len(text), 1024)]
    else:
        text_chunks = [text]
    
    embeddings = []
    batch_size = 10  # 批量处理大小
    
    for chunk in text_chunks:
        for start_idx in range(0, len(chunk), batch_size):
            end_idx = min(start_idx + batch_size, len(chunk))
            batch = chunk[start_idx:end_idx]
            response = self.llm.embeddings.create(input=batch, model=model)
            embeddings.extend([embedding.embedding for embedding in response.data])
    
    return embeddings

4.2 FAISS 索引创建

# ingestion.py:90-95
def _create_vector_db(self, embeddings: List[float]):
    """创建FAISS向量数据库"""
    embeddings_array = np.array(embeddings, dtype=np.float32)
    dimension = len(embeddings[0])
    index = faiss.IndexFlatIP(dimension)  # 使用内积相似度（余弦相似度）
    index.add(embeddings_array)
    return index

4.3 BM25 索引创建

# ingestion.py:20-23
def create_bm25_index(self, chunks: List[str]) -> BM25Okapi:
    """创建BM25索引"""
    tokenized_chunks = [chunk.split() for chunk in chunks]
    return BM25Okapi(tokenized_chunks)

技术特点：

使用 FAISS 进行高效的向量检索
支持 BM25 算法进行关键词检索
实现了重试机制处理API调用失败
支持批量处理提高效率

5. 多粒度检索器

5.1 向量检索器

# retrieval.py:103-117
class VectorRetriever:
    def __init__(self, vector_db_dir: Path, documents_dir: Path):
        self.vector_db_dir = vector_db_dir
        self.documents_dir = documents_dir
        self.all_dbs = self._load_dbs()
        self.llm = self._set_up_llm()
    
    def _load_dbs(self):
        """加载所有向量数据库和文档"""
        all_dbs = []
        all_documents_paths = list(self.documents_dir.glob('*.json'))
        vector_db_files = {db_path.stem: db_path for db_path in self.vector_db_dir.glob('*.faiss')}
        
        for document_path in all_documents_paths:
            stem = document_path.stem
            if stem not in vector_db_files:
                continue
                
            # 加载文档和向量数据库
            with open(document_path, 'r', encoding='utf-8') as f:
                document = json.load(f)
            
            vector_db = faiss.read_index(str(vector_db_files[stem]))
            
            all_dbs.append({
                "name": stem,
                "vector_db": vector_db,
                "document": document
            })

5.2 混合检索器

# retrieval.py:250-280
class HybridRetriever(VectorRetriever):
    def __init__(self, vector_db_dir: Path, documents_dir: Path, bm25_db_path: Path):
        super().__init__(vector_db_dir, documents_dir)
        self.bm25_db_path = bm25_db_path
        self.bm25_dbs = self._load_bm25_dbs()
    
    def hybrid_search(self, query: str, document_name: str, top_k: int = 10):
        """结合向量检索和BM25检索"""
        # 向量检索
        vector_results = self.vector_search(query, document_name, top_k)
        
        # BM25检索
        bm25_results = self.bm25_search(query, document_name, top_k)
        
        # 结果融合
        fused_results = self._fusion_results(vector_results, bm25_results)
        
        # LLM重排序
        if self.llm_reranking:
            fused_results = self.llm_reranker.rerank_documents(query, fused_results)
        
        return fused_results

5.3 Step-Back 查询处理

# retrieval.py:48-100
def get_step_back_query(query: str) -> str:
    """生成step-back查询以提高检索效果"""
    system_prompt = '''you must return with follow format: {"schema": {"reasoning_summary": "string (concise 1-2 sentence summary)","step_back_answer": "string (direct answer to the step-back question)"}'''
    
    llm = OpenAI(api_key="", base_url="")
    
    completion = llm.chat.completions.create(
        model="qwen-turbo",
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What is a step-back question of this question: " + query},
        ],
        extra_body={"enable_thinking": True},  # 启用推理模式
        stream=True,
    )
    
    # 处理流式响应
    reasoning_content = ""
    answer_content = ""
    is_answering = False
    
    for chunk in completion:
        delta = chunk.choices[0].delta
        
        if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
            reasoning_content += delta.reasoning_content
        
        if hasattr(delta, "content") and delta.content:
            if not is_answering:
                is_answering = True
            answer_content += delta.content
    
    return answer_content

技术特点：

支持向量检索和BM25检索的混合
实现了step-back查询技术提高检索质量
支持LLM重排序优化检索结果
父文档检索提供更丰富的上下文

6. LLM 重排序器

6.1 单文档重排序

# reranking.py:27-71
def get_rank_for_single_block(self, query, retrieved_document):
    """对单个检索文档进行重排序"""
    user_prompt = f'/nHere is the query:/n"{query}"/n/nHere is the retrieved text block:/n"""/n{retrieved_document}/n"""/n'
    
    completion = self.llm.chat.completions.create(
        model="qwen-turbo",
        temperature=0,
        messages=[
            {"role": "system", "content": self.system_prompt_rerank_single_block},
            {"role": "user", "content": user_prompt},
        ],
        extra_body={"enable_thinking": True},
        stream=True,
    )
    
    # 处理流式响应
    reasoning_content = ""
    answer_content = ""
    is_answering = False
    
    for chunk in completion:
        delta = chunk.choices[0].delta
        
        if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
            reasoning_content += delta.reasoning_content
        
        if hasattr(delta, "content") and delta.content:
            if not is_answering:
                is_answering = True
            answer_content += delta.content
    
    return answer_content

6.2 批量重排序

# reranking.py:73-100
def rerank_documents(self, query: str, documents: list, 
                   documents_batch_size: int = 1, 
                   llm_weight: float = 0.7):
    """批量重排序多个文档"""
    # 创建文档批次
    doc_batches = [documents[i:i + documents_batch_size] 
                  for i in range(0, len(documents), documents_batch_size)]
    vector_weight = 1 - llm_weight
    
    # 使用线程池并行处理
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = []
        for batch in doc_batches:
            future = executor.submit(
                self._process_batch, query, batch, vector_weight, llm_weight
            )
            futures.append(future)
        
        # 收集结果
        all_results = []
        for future in futures:
            batch_results = future.result()
            all_results.extend(batch_results)
    
    return all_results

技术特点：

支持单文档和批量文档重排序
使用线程池提高处理效率
结合向量相似度和LLM相关性分数
支持可配置的权重平衡

7. 问题处理器

7.1 问题处理流程

# questions_processing.py:136-180
def get_answer_for_question(self, document_name: str, question: str, schema: str) -> dict:
    """处理单个问题并生成答案"""
    
    # 选择检索器
    if self.llm_reranking:
        retriever = HybridRetriever(
            vector_db_dir=self.vector_db_dir,
            documents_dir=self.documents_dir
        )
    else:
        retriever = VectorRetriever(
            vector_db_dir=self.vector_db_dir,
            documents_dir=self.documents_dir
        )
    
    # 执行检索
    if self.full_context:
        retrieval_results = retriever.retrieve_all(document_name)
    else:
        retrieval_results = retriever.retrieve(question, document_name, self.top_n_retrieval)
    
    # 格式化检索结果
    context = self._format_retrieval_results(retrieval_results, picture_answer)
    
    # 生成答案
    answer = self.openai_processor.generate_answer(
        question=question,
        context=context,
        schema=schema,
        model=self.answering_model
    )
    
    return answer

7.2 并行问题处理

# questions_processing.py:200-250
def process_all_questions(self, output_path: Path, submission_file: bool = True,
                         team_email: str = "", submission_name: str = "",
                         pipeline_details: str = ""):
    """并行处理所有问题"""
    
    # 预处理问题
    processed_questions = self._preprocess_questions()
    
    # 使用线程池并行处理
    with concurrent.futures.ThreadPoolExecutor(max_workers=self.parallel_requests) as executor:
        futures = []
        for question_data in processed_questions:
            future = executor.submit(
                self._process_single_question,
                question_data
            )
            futures.append(future)
        
        # 收集结果
        for future in tqdm(concurrent.futures.as_completed(futures), 
                          total=len(futures), desc="Processing questions"):
            try:
                result = future.result()
                with self._lock:
                    self.answer_details.append(result)
                    self.detail_counter += 1
            except Exception as e:
                print(f"Error processing question: {e}")
    
    # 保存结果
    self._save_results(output_path, submission_file, team_email, submission_name, pipeline_details)