欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/150273219
免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。
MinerU: An Open-Source Solution for Precise Document Content Extraction
- 精准文档内容提取的开源解决方案
来源于 2024.9.27,上海人工智能实验室
GitHub:https://github.com/opendatalab/MinerU
MinerU 的处理工作流分为 4 个阶段:
-
文档预处理(Document Preprocessing):使用 PyMuPDF 读取 PDF 文件,过滤掉无法处理的文件,提取 PDF 元数据,包括文档的可解析性、语言类型和页面尺寸。
-
文档内容解析(Content Parsing):使用高质量的 PDF 文档提取算法库 PDF-Extract-Kit 解析关键文档内容。从布局分析开始,包括布局和公式检测。然后,对于不同区域应用不同的识别器:OCR 用于文本和标题,公式识别用于公式,表格识别用于表格。
-
文档内容后处理(Content Post-Processing):基于第二阶段的输出,此阶段移除无效区域,根据区域定位信息拼接内容,最终获得不同文档区域的定位、内容和排序信息。
-
格式转换(Format Conversion):基于文档后处理的结果,可以生成用户所需的多种格式,如 Markdown,以便后续使用。
1. 配置 Docker
构建 Docker:
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile
docker build -t mineru-sglang-api:base -f Dockerfile .
建议:使用 国内(china) 版本的 Dockerfile,主要是增加 pip 的国内源。
在 Dockerfile 中,默认安装的 MinerU 是 [core]
版本。
# Use the official sglang image
FROM lmsysorg/sglang:v0.4.7-cu124
# install mineru latest
RUN python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# Download models and update the configuration file
RUN /bin/bash -c "mineru-models-download -s modelscope -m all"
# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "export MINERU_MODEL_SOURCE=local && exec \"$@\"", "--"]
依赖模型,约 4.3G 大小,即:
- MinerU2.0:时间 2025.5,模型大小是 0.9B
- PDF-Extract-Kit:PDF 提取套件
- YOLO文档布局模型:
doclayout_yolo_docstructbench_imgsz1280_2501.pt
- 数学公式检测(Mathematical Formula Detection):
yolo_v8_ft.pt
- 数学公式识别(Mathematical Formula Recognition):
unimernet_hf_small_2503
- OCR:
paddleocr_torch
- 阅读顺序(ReadingOrder):
layout_reader
- 表格识别(Table Recognition/Reconstruction):
SlanetPlus
,即Spatial LAyout-based table NETwork Plus,基于空间布局关系建模的表格识别算法。
即:
# /root/.cache/modelscope/hub/models/OpenDataLab/
|-- MinerU2.0-2505-0.9B -> /root/.cache/modelscope/hub/models/OpenDataLab/MinerU2___0-2505-0___9B
|-- MinerU2___0-2505-0___9B
|-- PDF-Extract-Kit-1.0 -> /root/.cache/modelscope/hub/models/OpenDataLab/PDF-Extract-Kit-1___0
`-- PDF-Extract-Kit-1___0
`-- models
|-- Layout
| `-- YOLO
| `-- doclayout_yolo_docstructbench_imgsz1280_2501.pt
|-- MFD
| `-- YOLO
| `-- yolo_v8_ft.pt
|-- MFR
| `-- unimernet_hf_small_2503
|-- OCR
| `-- paddleocr_torch
|-- ReadingOrder
| `-- layout_reader
`-- TabRec
`-- SlanetPlus
Docker Image 构建完成:
REPOSITORY TAG IMAGE ID CREATED SIZE
mineru-sglang-api latest dbf67f121053 2 hours ago 24.8GB
启动 Docker:
docker run -itd \
--name mineru-sglang-api \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--ipc=host \
-v xxx/:xxx/ \
--privileged \
--network host \
mineru-sglang-api:v1.1 \
/bin/bash
# docker exec -it mineru-sglang-api /bin/bash
docker attach mineru-sglang-api
在基础 Docker 中,默认只包括 sglang 工程,即:
/sgl-workspace# tree -L 2
.
`-- sglang
|-- 3rdparty
|-- LICENSE
|-- Makefile
|-- README.md
|-- assets
|-- benchmark
|-- docker
|-- docs
|-- examples
|-- python
|-- scripts
|-- sgl-kernel
|-- sgl-pdlb
|-- sgl-router
`-- test
启动服务,注意,避免端口号重复,即:
mineru-sglang-server --host 0.0.0.0 --port 9001
# CUDA_VISIBLE_DEVICES="3" nohup mineru-sglang-server --host 0.0.0.0 --port 9001 > nohup.out &
# 批量结束进程 mineru-sglang-server
ps -ef | grep "mineru-sglang-server" | grep -v grep | awk '{print $2}' | xargs kill -9
80G 显存大约占用 67G,即占比 80%,用于启动多个实例,即:
| 3 NVIDIA A800-SXM4-80GB Off | 00000000:4B:00.0 Off | 0 |
| N/A 30C P0 59W / 400W | 67032MiB / 81920MiB | 0% Default |
| | | Disabled |
如果是 24G 的 4090 显卡,也可以调用,实例数量减少。
2. 测试服务
配置 MinerU 服务:
eval "$(micromamba shell hook --shell zsh)"
micromamba create -n mineru python=3.12
micromamba activate mineru
# 必须预先安装
micromamba install onnxruntime
micromamba install numpy==2.3.1
pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install uv -i https://mirrors.aliyun.com/pypi/simple
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
测试数据:
mydata/test_contract.pdf
https://xxx/test_contract.pdf
测试服务:
# 本地测试
mineru -p "data/test_contract.pdf" -o "test_pdf_output" -b vlm-sglang-client -u http://0.0.0.0:9001
# 远程测试
mineru -p "data/test_contract.pdf" -o "test_output" -b vlm-sglang-client -u http://xxx.xxx
# mineru 暂时不支持 URL,需要适配
# mineru -p "https://xxx/test_contract.pdf" -o test_url_output -b vlm-sglang-client -u http://0.0.0.0:9001
mineru -p "https://xxx/test_contract.pdf" -o test_url_output -b vlm-sglang-client -u "http://xxx.xxx"
运行时间:
# PDF 运行时间
WARNING:sglang.srt.models.registry:Ignore import error when loading sglang.srt.models.torch_native_llama. tensor model parallel group is not initialized
2025-07-02 06:24:30.038 | INFO | mineru.backend.vlm.predictor:get_predictor:110 - get_predictor cost: 1.09s
2025-07-02 06:24:46.159 | INFO | mineru.cli.common:do_parse:224 - local output dir is test_output/test_contract/vlm
图像暂时无法处理。
Bugfix:运行图像服务 ImportError,即:
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
解决方案参考:ImportError: libGL.so.1: cannot open shared object file: No such file or directory
RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
本地测试,参考 MinerU - demo/demo.py:
# 本地模式,支持处理图像
parse_doc(doc_path_list, output_dir, backend="pipeline")
# 远程模式
parse_doc(doc_path_list, output_dir, backend="vlm-sglang-client", server_url="http://xxx.xxx")
parse_doc(doc_path_list, output_dir, backend="vlm-sglang-client", server_url="0.0.0.0:9001")
3. 优化 MinerU
使用自定义版本的 MinerU 构建服务,即:
# 配置 ssh
vim ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
# 下载工程
git clone https://github.com/opendatalab/MinerU.git
# 卸载当前版本
pip show mineru
pip uninstall mineru --break-system-packages
# 切换分支
cd mineru
git branch -a
git checkout version
# 更新版本
python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# pip install --upgrade -e '.[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# pip install --upgrade -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# 增加环境变量
vim ~/.bashrc
export MINERU_MODEL_SOURCE="modelscope"
# 测试
bash /sgl-workspace/start.sh
环境变量
MINERU_MODEL_SOURCE
对于启动服务,非常关键。
增加相关启动文件:
check_health.sh
#!/bin/bash
url="http://127.0.0.1:9001/health"
http_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 --location "$url")
if [ $? -eq 0 ] && [ "$http_code" = "200" ]; then
echo 0
else
echo -1
fi
start.sh
,注意:建议使用 start.sh 启动服务,避免直接调用。
#!/bin/bash
# Script name
SCRIPT_NAME=$(basename "$0")
# Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $SCRIPT_NAME: $1"
}
# Check current working directory
check_working_directory() {
log "Current working directory: $(pwd)"
# Check write permissions
if [ ! -w . ]; then
log "WARNING: Current user doesn't have write permissions in this directory"
fi
# Check available disk space
local available_space=$(df -h . | awk 'NR==2 {print $4}')
log "Available space in current directory: $available_space"
}
# Check environment variables
check_environment_variables() {
log "Checking environment variables..."
# Required environment variables
local required_vars=("HOME" "PATH" "USER")
for var in "${required_vars[@]}"; do
if [ -z "${!var}" ]; then
log "ERROR: Environment variable $var is not set"
else
log "$var=${!var}"
fi
done
# Check if mineru-sglang-server is in PATH
if ! command -v mineru-sglang-server >/dev/null 2>&1; then
log "ERROR: mineru-sglang-server not found in PATH"
log "Current PATH: $PATH"
fi
# Check other potentially important environment variables
local optional_vars=("LD_LIBRARY_PATH" "CUDA_HOME" "PYTHONPATH" "MINERU_MODEL_SOURCE")
for var in "${optional_vars[@]}"; do
if [ -n "${!var}" ]; then
log "$var=${!var}"
fi
done
}
# Main function
main() {
log "Starting environment checks"
check_working_directory
check_environment_variables
log "Environment checks completed"
# Execute the original command
log "Attempting to execute: mineru-sglang-server --host 0.0.0.0 --port 9001"
exec mineru-sglang-server --host 0.0.0.0 --port 9001
}
# Execute main function
main
hang_up.sh
#!/bin/bash
while true
do
sleep 600
echo "[$(date)] sleep..."
done
测试服务:
# 启动服务
cd /sgl-workspace
nohup bash start.sh > nohup.out &
# 测试是否支持 PDF 的 URL 模式
mineru -p "https://xxx/test_contract.pdf" -o test_url_output -b vlm-sglang-client -u http://0.0.0.0:9001
# 关闭服务
ps -ef | grep "mineru-sglang-server" | grep -v grep | awk '{print $2}' | xargs kill -9
本地与远程 PDF 差异:
import requests
from pathlib import Path
def read_fn(path):
if not isinstance(path, Path):
path = Path(path)
with open(str(path), "rb") as input_file:
file_bytes = input_file.read()
return file_bytes
def process_local_pdf(pdf_path):
"""
处理本地PDF文件
"""
file_name = str(Path(pdf_path).stem)
print(f"Processing local PDF: {file_name}")
pdf_bytes = read_fn(pdf_path)
return pdf_bytes
def process_remote_pdf(pdf_url, timeout=30):
"""
处理远程PDF文件
"""
# 下载远程PDF文件
file_name = str(Path(pdf_url).stem)
print(f"Processing remote PDF: {file_name}")
response = requests.get(pdf_url, timeout=timeout)
response.raise_for_status()
# 返回PDF字节内容,与process_local_pdf保持一致
return response.content
def main():
path1 = "mydata/test_contract.pdf"
path2 = "https://xxx/test_contract.pdf"
content1 = process_local_pdf(path1)
content2 = process_remote_pdf(path2)
print(f"Local PDF: {len(content1)}")
print(f"Remote PDF: {len(content2)}")
assert content1 == content2
if __name__ == "__main__":
main()
4. 构建工程
远程的容器镜像
docker ps -a | grep "mineru"
docker login micr.cloud.mioffice.cn -u xxx -p xxx
docker commit c34ed52eca37 mineru-sglang-api
docker tag mineru-sglang-api:v1.1
docker push mineru-sglang-api:v1.1
docker inspect mineru-sglang-api
docker images
# 停止和删除
docker ps -a | grep "mineru"
docker stop xxx
docker rm xxx
docker images | grep "mineru"
docker rmi xxx
验证 Docker Image:
docker run -itd \
--name mineru-sglang-api \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--ipc=host \
-v xxx/:xxx/ \
--privileged \
--network host \
mineru-sglang-api \
/bin/bash
docker attach mineru-sglang-api
配置远程服务,先配置测试:
# 容器镜像
mineru-sglang-api:v1.1
bash /sgl-workspace/start.sh # 启动命令
bash /sgl-workspace/check_health.sh # 健康检查
# 环境变量
export MINERU_MODEL_SOURCE="modelscope"