LLM - 搭建 MinerU 模型的文档解析服务 API

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/150273219

免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。


MinerU: An Open-Source Solution for Precise Document Content Extraction

  • 精准文档内容提取的开源解决方案

来源于 2024.9.27,上海人工智能实验室

GitHub:https://github.com/opendatalab/MinerU

MinerU

MinerU 的处理工作流分为 4 个阶段:

  • 文档预处理(Document Preprocessing):使用 PyMuPDF 读取 PDF 文件,过滤掉无法处理的文件,提取 PDF 元数据,包括文档的可解析性、语言类型和页面尺寸。

  • 文档内容解析(Content Parsing):使用高质量的 PDF 文档提取算法库 PDF-Extract-Kit 解析关键文档内容。从布局分析开始,包括布局和公式检测。然后,对于不同区域应用不同的识别器:OCR 用于文本和标题,公式识别用于公式,表格识别用于表格。

  • 文档内容后处理(Content Post-Processing):基于第二阶段的输出,此阶段移除无效区域,根据区域定位信息拼接内容,最终获得不同文档区域的定位、内容和排序信息。

  • 格式转换(Format Conversion):基于文档后处理的结果,可以生成用户所需的多种格式,如 Markdown,以便后续使用。

1. 配置 Docker

构建 Docker:

wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile
docker build -t mineru-sglang-api:base -f Dockerfile .

建议:使用 国内(china) 版本的 Dockerfile,主要是增加 pip 的国内源。

在 Dockerfile 中,默认安装的 MinerU 是 [core] 版本。

# Use the official sglang image
FROM lmsysorg/sglang:v0.4.7-cu124

# install mineru latest
RUN python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages

# Download models and update the configuration file
RUN /bin/bash -c "mineru-models-download -s modelscope -m all"

# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "export MINERU_MODEL_SOURCE=local && exec \"$@\"", "--"]

依赖模型,约 4.3G 大小,即:

  • MinerU2.0:时间 2025.5,模型大小是 0.9B
  • PDF-Extract-Kit:PDF 提取套件
  • YOLO文档布局模型:doclayout_yolo_docstructbench_imgsz1280_2501.pt
  • 数学公式检测(Mathematical Formula Detection):yolo_v8_ft.pt
  • 数学公式识别(Mathematical Formula Recognition): unimernet_hf_small_2503
  • OCR:paddleocr_torch
  • 阅读顺序(ReadingOrder):layout_reader
  • 表格识别(Table Recognition/Reconstruction):SlanetPlus,即Spatial LAyout-based table NETwork Plus,基于空间布局关系建模的表格识别算法。

即:

# /root/.cache/modelscope/hub/models/OpenDataLab/
|-- MinerU2.0-2505-0.9B -> /root/.cache/modelscope/hub/models/OpenDataLab/MinerU2___0-2505-0___9B
|-- MinerU2___0-2505-0___9B
|-- PDF-Extract-Kit-1.0 -> /root/.cache/modelscope/hub/models/OpenDataLab/PDF-Extract-Kit-1___0
`-- PDF-Extract-Kit-1___0
    `-- models
        |-- Layout
        |   `-- YOLO
        |       `-- doclayout_yolo_docstructbench_imgsz1280_2501.pt
        |-- MFD
        |   `-- YOLO
        |       `-- yolo_v8_ft.pt
        |-- MFR
        |   `-- unimernet_hf_small_2503
        |-- OCR
        |   `-- paddleocr_torch
        |-- ReadingOrder
        |   `-- layout_reader
        `-- TabRec
            `-- SlanetPlus

Docker Image 构建完成:

REPOSITORY					TAG       IMAGE ID       CREATED        SIZE
mineru-sglang-api		latest    dbf67f121053   2 hours ago    24.8GB

启动 Docker:

docker run -itd \
--name mineru-sglang-api \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--ipc=host \
-v xxx/:xxx/ \
--privileged \
--network host \
mineru-sglang-api:v1.1 \
/bin/bash

# docker exec -it mineru-sglang-api /bin/bash

docker attach mineru-sglang-api

在基础 Docker 中,默认只包括 sglang 工程,即:

/sgl-workspace# tree -L 2
.
`-- sglang
    |-- 3rdparty
    |-- LICENSE
    |-- Makefile
    |-- README.md
    |-- assets
    |-- benchmark
    |-- docker
    |-- docs
    |-- examples
    |-- python
    |-- scripts
    |-- sgl-kernel
    |-- sgl-pdlb
    |-- sgl-router
    `-- test

启动服务,注意,避免端口号重复,即:

mineru-sglang-server --host 0.0.0.0 --port 9001
# CUDA_VISIBLE_DEVICES="3" nohup mineru-sglang-server --host 0.0.0.0 --port 9001 > nohup.out &

# 批量结束进程 mineru-sglang-server
ps -ef | grep "mineru-sglang-server" | grep -v grep | awk '{print $2}' | xargs kill -9

80G 显存大约占用 67G,即占比 80%,用于启动多个实例,即:

|   3  NVIDIA A800-SXM4-80GB          Off |   00000000:4B:00.0 Off |                    0 |
| N/A   30C    P0             59W /  400W |   67032MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |

如果是 24G 的 4090 显卡,也可以调用,实例数量减少。

2. 测试服务

配置 MinerU 服务:

eval "$(micromamba shell hook --shell zsh)"

micromamba create -n mineru python=3.12
micromamba activate mineru

# 必须预先安装
micromamba install onnxruntime
micromamba install numpy==2.3.1

pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install uv -i https://mirrors.aliyun.com/pypi/simple
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple 

测试数据:

mydata/test_contract.pdf
https://xxx/test_contract.pdf

测试服务:

# 本地测试
mineru -p "data/test_contract.pdf" -o "test_pdf_output" -b vlm-sglang-client -u http://0.0.0.0:9001

# 远程测试
mineru -p "data/test_contract.pdf" -o "test_output" -b vlm-sglang-client -u http://xxx.xxx

# mineru 暂时不支持 URL,需要适配
# mineru -p "https://xxx/test_contract.pdf" -o test_url_output -b vlm-sglang-client -u http://0.0.0.0:9001

mineru -p "https://xxx/test_contract.pdf" -o test_url_output -b vlm-sglang-client -u "http://xxx.xxx"

运行时间:

# PDF 运行时间
WARNING:sglang.srt.models.registry:Ignore import error when loading sglang.srt.models.torch_native_llama. tensor model parallel group is not initialized
2025-07-02 06:24:30.038 | INFO     | mineru.backend.vlm.predictor:get_predictor:110 - get_predictor cost: 1.09s
2025-07-02 06:24:46.159 | INFO     | mineru.cli.common:do_parse:224 - local output dir is test_output/test_contract/vlm

图像暂时无法处理。

Bugfix:运行图像服务 ImportError,即:

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

解决方案参考:ImportError: libGL.so.1: cannot open shared object file: No such file or directory

RUN apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

本地测试,参考 MinerU - demo/demo.py

# 本地模式,支持处理图像
parse_doc(doc_path_list, output_dir, backend="pipeline")

# 远程模式
parse_doc(doc_path_list, output_dir, backend="vlm-sglang-client", server_url="http://xxx.xxx")
parse_doc(doc_path_list, output_dir, backend="vlm-sglang-client", server_url="0.0.0.0:9001")

3. 优化 MinerU

使用自定义版本的 MinerU 构建服务,即:

# 配置 ssh
vim ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa

# 下载工程
git clone https://github.com/opendatalab/MinerU.git

# 卸载当前版本
pip show mineru
pip uninstall mineru --break-system-packages

# 切换分支
cd mineru
git branch -a
git checkout version

# 更新版本
python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# pip install --upgrade -e '.[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# pip install --upgrade -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages

# 增加环境变量
vim ~/.bashrc
export MINERU_MODEL_SOURCE="modelscope"

# 测试
bash /sgl-workspace/start.sh

环境变量 MINERU_MODEL_SOURCE 对于启动服务,非常关键。

增加相关启动文件:

check_health.sh

#!/bin/bash

url="http://127.0.0.1:9001/health"
http_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 --location "$url")
if [ $? -eq 0 ] && [ "$http_code" = "200" ]; then
    echo 0
else
    echo -1
fi

start.sh,注意:建议使用 start.sh 启动服务,避免直接调用。

#!/bin/bash

# Script name
SCRIPT_NAME=$(basename "$0")

# Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $SCRIPT_NAME: $1"
}

# Check current working directory
check_working_directory() {
    log "Current working directory: $(pwd)"
    
    # Check write permissions
    if [ ! -w . ]; then
        log "WARNING: Current user doesn't have write permissions in this directory"
    fi
    
    # Check available disk space
    local available_space=$(df -h . | awk 'NR==2 {print $4}')
    log "Available space in current directory: $available_space"
}

# Check environment variables
check_environment_variables() {
    log "Checking environment variables..."
    
    # Required environment variables
    local required_vars=("HOME" "PATH" "USER")
    
    for var in "${required_vars[@]}"; do
        if [ -z "${!var}" ]; then
            log "ERROR: Environment variable $var is not set"
        else
            log "$var=${!var}"
        fi
    done
    
    # Check if mineru-sglang-server is in PATH
    if ! command -v mineru-sglang-server >/dev/null 2>&1; then
        log "ERROR: mineru-sglang-server not found in PATH"
        log "Current PATH: $PATH"
    fi
    
    # Check other potentially important environment variables
    local optional_vars=("LD_LIBRARY_PATH" "CUDA_HOME" "PYTHONPATH" "MINERU_MODEL_SOURCE")
    
    for var in "${optional_vars[@]}"; do
        if [ -n "${!var}" ]; then
            log "$var=${!var}"
        fi
    done
}

# Main function
main() {
    log "Starting environment checks"
    check_working_directory
    check_environment_variables
    log "Environment checks completed"
    # Execute the original command
    log "Attempting to execute: mineru-sglang-server --host 0.0.0.0 --port 9001"
    exec mineru-sglang-server --host 0.0.0.0 --port 9001
}

# Execute main function
main

hang_up.sh

#!/bin/bash

while true
do
    sleep 600
    echo "[$(date)] sleep..."
done

测试服务:

# 启动服务
cd /sgl-workspace
nohup bash start.sh > nohup.out &

# 测试是否支持 PDF 的 URL 模式
mineru -p "https://xxx/test_contract.pdf" -o test_url_output -b vlm-sglang-client -u http://0.0.0.0:9001

# 关闭服务
ps -ef | grep "mineru-sglang-server" | grep -v grep | awk '{print $2}' | xargs kill -9

本地与远程 PDF 差异:

import requests
from pathlib import Path


def read_fn(path):
    if not isinstance(path, Path):
        path = Path(path)
    with open(str(path), "rb") as input_file:
        file_bytes = input_file.read()
    return file_bytes


def process_local_pdf(pdf_path):
    """
    处理本地PDF文件
    """
    file_name = str(Path(pdf_path).stem)
    print(f"Processing local PDF: {file_name}")
    pdf_bytes = read_fn(pdf_path)
    return pdf_bytes


def process_remote_pdf(pdf_url, timeout=30):
    """
    处理远程PDF文件
    """
    # 下载远程PDF文件
    file_name = str(Path(pdf_url).stem)
    print(f"Processing remote PDF: {file_name}")
    response = requests.get(pdf_url, timeout=timeout)
    response.raise_for_status()

    # 返回PDF字节内容,与process_local_pdf保持一致
    return response.content


def main():
    path1 = "mydata/test_contract.pdf"
    path2 = "https://xxx/test_contract.pdf"
    content1 = process_local_pdf(path1)
    content2 = process_remote_pdf(path2)
    print(f"Local PDF: {len(content1)}")
    print(f"Remote PDF: {len(content2)}")
    assert content1 == content2


if __name__ == "__main__":
    main()

4. 构建工程

远程的容器镜像

docker ps -a | grep "mineru"

docker login micr.cloud.mioffice.cn -u xxx -p xxx
docker commit c34ed52eca37 mineru-sglang-api
docker tag mineru-sglang-api:v1.1
docker push mineru-sglang-api:v1.1
docker inspect mineru-sglang-api 
docker images

# 停止和删除
docker ps -a | grep "mineru"
docker stop xxx
docker rm xxx
docker images | grep "mineru"
docker rmi xxx

验证 Docker Image:

docker run -itd \
--name mineru-sglang-api \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--ipc=host \
-v xxx/:xxx/ \
--privileged \
--network host \
mineru-sglang-api \
/bin/bash

docker attach mineru-sglang-api

配置远程服务,先配置测试:

# 容器镜像
mineru-sglang-api:v1.1

bash /sgl-workspace/start.sh  				# 启动命令
bash /sgl-workspace/check_health.sh   	# 健康检查

# 环境变量
export MINERU_MODEL_SOURCE="modelscope"
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ManonLegrand

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值