Skip to content

opendatalab/MinerU-Popo

Repository files navigation

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

arXiv Hugging Face dataset License MIT

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📖 English  |  简体中文

image

✨ Introduction

MinerU-Popo is a lightweight and universal framework for POst-Processing OCR outputs, bridging the gap between page-level OCR parsing and document-level semantic structure. It constructs document tree structures with a 4B post-processing model that performs four subtasks: table truncation analysis, text truncation analysis, title hierarchy analysis, and image-text association analysis. We handle the challenges of cross-page geometric discontinuity, redundant document parsing, and scalability to long documents via:

  • Task-Oriented Data Engine: Generate representative training data and simplify the task-specific input.
  • Dynamic Chunking and Synchronization: Process long document by dynamic chunks and reduce deviations across chunks to preserve global consistency.
  • Document Enrichment: Structurally construct a tree, semantically generate summaries and split long-section nodes.

image

📊 Performance

Better Hierarchy (TEDS) after Post-Processing

Basic OCR Before After
MinerU 53.7 90.6
MonkeyOCR 48.9 87.4
Dolphin 60.4 83.5
PaddleOCR 59.3 82.6
GLM-OCR 53.5 81.8

Advantages Compared to Directly Using Pre-trained Model

Model TEDS Doc/s
MinerU-Popo 90.6 0.37
Qwen3-VL-2B 21.2 0.22
Qwen3-VL-4B 56.5 0.20
Qwen3-VL-8B 65.9 0.16
Qwen3-VL-32B 78.0 0.04

Benefits for Downstream Retrieval and Analysis (Acc on ViDoRe V3)

Method C.S. Fin. H.R. Ind. Phar.
MinerU-Popo 84.4 49.5 66.8 58.7 71.6
Raw RAG 82.3 48.7 63.2 60.4 64.4
Visual RAG 80.7 58.4 64.8 59.7 67.6

⚙️ Setup

  1. Prepare Environment
conda create -n popo python=3.10
conda activate popo
pip install -r requirements.txt
  1. Download Model

Download the MinerU-Popo post-processing model:

hf download DreamEternal/MinerU-Popo --local-dir models/Mineru-Popo
  1. Model Configuration

In the Configuration, for transformer inference, edit the environment POPO_MODEL_PATH. For vllm inference, edit the url and key in function popo_generate.

For enrichment and question answering, further edit the url and key in qwen_generate and gpt_generate.

💻 Usage

The post-processing pipeline takes page-level parsing results from OCR/layout systems, normalizes them into a unified schema, runs MinerU-Popo inference, and finally builds document trees.

Step 1: Prepare OCR/Layout Outputs

Run your preferred page-level parser first, such as MinerU, MonkeyOCR, Dolphin, PaddleOCR-VL, or GLM-OCR. Place each model's output under:

post-process/<model_name>/

For example:

post-process/mineru/
post-process/monkeyocr/
post-process/PaddleOCR-VL-1.5/
post-process/dolphin/
post-process/glm-ocr/

Step 2: Normalize Labels

Convert raw model-specific labels and bounding boxes into the unified MinerU-Popo input format:

bash scripts/run_label_normalization.sh

The normalized outputs are written to:

outputs/label_normalization/<model_name>/

Step 3: Run MinerU-Popo Inference

Run MinerU-Popo on the normalized labels:

bash scripts/run_inference.sh

The inference outputs are written to:

outputs/inference/<model_name>/

Step 4: Build Document Trees

Build structured document trees from the inference outputs:

bash scripts/build_tree.sh

The final tree outputs and text previews are written to:

outputs/build_tree/<model_name>/
outputs/build_tree_txt/<model_name>/

Example tree outputs are provided in:

output_cases/

🙏 Acknowledgements

  • MinerU and other OCR system (MonkeyOCR, Dolphin, PaddleOCR, GLM-OCR) for page-level parsing.
  • ViDoRe V3 and MMDA as benchmarks.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors