自然语言处理:第七十五章利用LLM从非结构化PDF中提取结构化知识

本文链接：https://blog.csdn.net/victor_manches/article/details/144218837

本人项目地址大全：Victor94-king/NLP__ManVictor: CSDN of ManVictor

写在前面: 笔者更新不易，希望走过路过点个关注和赞，笔芯!!!

在当今数据驱动的世界中，组织机构们坐拥着无数的PDF文档，这些文档中蕴含着丰富的信息宝藏。然而，尽管人类可以轻易地阅读这些文件，但对于试图理解和利用其内容的机器来说，却构成了巨大的挑战。无论是研究论文、技术手册还是商业报告，PDF文件常常包含能够驱动智能系统、助力数据驱动决策的有价值知识。但如何将这些非结构化的PDF数据转化为机器能够高效处理的结构化知识，成为了现代信息处理系统面临的核心挑战之一。

一、非结构化PDF数据的挑战

尽管PDF文档中的信息丰富多样，但面对非结构化数据时，三大核心挑战应运而生：

缺乏可解释性：难以追踪系统是如何得出特定答案的。
分析能力受限：非结构化数据限制了复杂分析的可能性。
精度降低：在处理大量信息时，这一点尤为明显。

这些限制凸显了结构化格式（如表格和知识图谱）（GraphRAG原理深入剖析-知识图谱构建）的强大之处。它们能够将原始信息转化为有组织、可查询的数据，从而使机器能够更有效地处理。为了将非结构化文档与结构化数据之间的鸿沟缩小，以便进行高级分析和AI应用，我们必须采取创新的手段。这不仅是现代信息处理系统的核心挑战，也是构建综合性知识图谱的重要目标。

二、PDF解析与结构化知识提取

将PDF内容转化为知识图谱的过程，不仅仅是简单的文本提取，而是需要理解上下文、识别关键概念以及识别思想之间的关系。这要求我们首先解析PDF内容。

1. PDF解析工具选择

在众多可用的解析库中，本文选择了PyMuPDF及其扩展PyMuPDF4LLM。PyMuPDF4LLM的Markdown提取功能保留了诸如标题和列表等关键结构元素，这极大地提升了大型语言模型（LLMs）对文档结构的识别和解释能力，从而显著增强了检索增强生成（Retrieval-Augmented Generation, RAG）的结果。

在解析PDF时，我使用PyMuPDF4LLM生成的Markdown作为每页文档的文本内容，同时提取所有图像，并将它们的OCR输出附加到相同的Markdown输出中。这种方法能够自动处理不同类型的PDF：仅有图像的扫描PDF、包含文本的PDF以及同时包含图像和文本的PDF。

2. OCR技术

对于图像中的文本提取，我们使用了PyTesseract，它是Google Tesseract-OCR引擎的Python封装。

import base64
from typing import Union, List, Dict, Any


import pymupdf
import pymupdf4llm
import numpy as np
from PIL import Image
from langchain_core.language_models import BaseLanguageModel
from langchain_core.messages import SystemMessage, HumanMessage




def ocr_images_pytesseract(images: List[Union[np.ndarray | Image.Image]]) -> str:
    import pytesseract


    all_text: str = ""
    for image in images:
        if isinstance(image, Image.Image):
            image = image.filter(ImageFilter.SMOOTH())
        all_text += '\n' +  pytesseract.image_to_string(image, lang="eng", config='--psm 3 --dpi 300 --oem 1')
    return all_text.strip('\n \t')


# Use PyMuPDF to open the document and PyMuPDF4LLM to get the markdown output
doc = pymupdf.Document(pdf_path)
doc_markdown = pymupdf4llm.to_markdown(doc, page_chunks=True, write_images=False, force_text=True)




def extract_page_info(page_num: int, doc_metadata: dict, toc: list = None) -> Dict[str, Any]:
    """Extracts text and metadata from a single page of a PDF document.


    Args:
        page_num (int): The page number
        doc_metadata (dict): The whole document metadata to store along with each page metadata
        toc (list | None): The list that represents the table-of-contents of the document

    Returns:
        A dictionary with the following keys:
            1) text: Text content extracted from the page
            2) page_metadata: Metadata specific to the page only like page number, chapter number etc.
            3) doc_metadata: Metadata common to the whole document like filename, author etc.

    """
    page_info = {}


    # Read the page of 
    page = doc[page_num]


    # doc_markdown stores the page-by-page markdown output of the document
    text_content: str = self.doc_markdown[page_num]['text']


    # Get a list of all t