Skip to content

Latest commit





mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Anwen Hu, Haiyang Xu†, Jiabo Ye, Ming Yan†, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

† Corresponding Author

Data: DocStruct4M 🤗 DocReason25K 🤗 DocDownstream 🤗 DocLocal4K 🤗
Models: DocOwl1.5-stage1 🤗 DocOwl1.5 🤗 DocOwl1.5-Chat 🤗 DocOwl1.5-Omni 🤗

image image


  • Support struct-aware document parsing, table to markdown, chart to markdown.

  • Support multi-grained text recognition and text grounding

  • Support question answering with simple phrases or detailed explanations.

  • Open Source

    • ✅ Training Data: DocStruct4M, DocReason25K, DocDownsteam-1.0
    • ✅ Mutli-grained Text Localization Evaluation set: DocLocal4K
    • ✅ Model: DocOwl1.5-stage1, DocOwl1.5, DocOwl1.5-Chat, DocOwl1.5-Omni
    • ✅ Source code of model inference and evaluation.
    • ✅ Online Demo on ModelScope and HuggingFace.
    • ✅ Source code of launching a local demo.
    • ✅ Training code.


🤗 HuggingFace Space

ModelScope Space

Training and Evaluation Datasets

Dataset Download Link
  • HuggingFace: mPLUG/DocStruct4M
  • ModelScope: iic/DocStruct4M
  • DocDownstream-1.0
  • HuggingFace: mPLUG/DocDownstream-1.0
  • ModelScope: iic/DocDownstream-1.0
  • DocReason25K
  • HuggingFace: mPLUG/DocReason25K
  • ModelScope: iic/DocReason25K
  • DocLocal4K
  • HuggingFace: mPLUG/DocLocal4K
  • ModelScope: iic/DocLocal4K
  • DocStruct4M

    DocStruct4M is a training set for Unified Structure Learning, covering images of documents, webpages, tables, charts and natural images. It consists of ~3M samples for Struct-aware Parsing tasks and ~1M samples for Multi-grained Text Localization tasks.

    Download DocStruct4M dataset from huggingface mPLUG/DocStruct4M. Training images (~311G) are split into 8 files, run following cmds to prepare training and validation images.

    cat partial-imgs* > imgs.tar.gz
    tar -zxvf imgs.tar.gz
    tar -zxvf val_imgs.tar.gz

    The dataset is organized in such format:

    ├── imgs
    ├── val_imgs
    ├── multi_grained_text_localization.jsonl
    ├── struct_aware_parse.jsonl
    ├── val.jsonl

    The ./imgs and ./val_imgs directory contains images for the training and validation samples, respectively.


    DocDownstream-1.0 is the combination of 10 text-rich image understanding benchmarks, including DocVQA, InfographicsVQA, DeepForm, KleisterCharity, WikiTableQuestions, TabFact, ChartQA, TextCaps, TextVQA and VisualMRC, covering tasks of Information Extraction, Visual Question Answering, Natural Language Inference and Image Captioning. All tasks are unified in the form of Visual Question Answering.

    Download DocDownstream-1.0 dataset from huggingface mPLUG/DocDownstream-1.0. Images (~70G) are split into 2 files, run following cmds to prepare images.

    cat partial-imgs* > imgs.tar.gz
    tar -zxvf imgs.tar.gz

    The dataset is organized in such format:

    ├── meta
    ├── test
    ├── imgs
    ├── train.jsonl
    ├── val.jsonl

    The ./imgs directory contains images for the training/validation/test samples. The train.jsonl and val.jsonl are ensembled samples of 10 datasets for training and validation. There are ~57w samples in train.jsonl. The ./test directory contain test files for each dataset. The ./meta directory contain meta files used for evaluation.


    DocReason25K is instruction tuning set with detailed explanation for Visual Document Understanding. It's built based on training samples from DocVQA, InfographicsVQA, WikiTableQuestions, VisualMRC, ChartQA and TextVQA. Detailed explanations are given by GPT3.5/GPT4V and further filtred according to manually annoatetd simple answer.

    Download DocReason25K dataset from huggingface mPLUG/DocReason25K. The dataset is organized in such format:

    ├── imgs
    ├── detailed_explanation.jsonl


    DocLocal4K is a evaluation set for Multi-grained Text Localization, covering both text recognition and text grounding tasks.

    Download DocLocal4K dataset from huggingface mPLUG/DocLocal4K. The dataset is organized in such format:

    ├── imgs
    ├── text_grounding.jsonl
    ├── text_recognition.jsonl


    Model Card

    Model Download Link Abilities
  • 🤗 mPLUG/DocOwl1.5-stage1
  • iic/DocOwl1.5-stage1
  • document/webpage parsing
  • table to markdown
  • chart to markdown
  • natural image parsing
  • multi-grained text recognition
  • multi-grained text grounding
  • DocOwl1.5
  • 🤗 mPLUG/DocOwl1.5
  • iic/DocOwl1.5
  • VQA with concise answers
  • infomation extraction
  • image captioning
  • natural language inference
  • DocOwl1.5-Chat
  • 🤗 mPLUG/DocOwl1.5-Chat
  • iic/DocOwl1.5-Chat
  • VQA with detailed explanations
  • VQA with concise answers
  • infomation extraction
  • image captioning
  • natural language inference
  • DocOwl1.5-Omni
  • 🤗 mPLUG/DocOwl1.5-Omni
  • iic/DocOwl1.5-Omni
  • document/webpage parsing
  • table to markdown
  • chart to markdown
  • natural image parsing
  • multi-grained text recognition
  • multi-grained text grounding
  • VQA with detailed explanations
  • VQA with concise answers
  • infomation extraction
  • image captioning
  • natural language inference
  • Model Inference

    prepare python environments as mPLUG-Owl2. Versions of some important packages are: transformers==4.31.0

    • DocOwl1.5-stage1 inference examples
    from docowl_infer import DocOwlInfer
    docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=False)
    print('load model from ', model_path)
    # document/webpage parsing
    query='Recognize text in the image.'
    answer=docowl.inference(image, query)
    # table/chart to markdown
    query='Convert the picture to Markdown syntax.'
    answer=docowl.inference(image, query)
    # natural image parsing
    query='Provide a description of the image content and text.'
    answer=docowl.inference(image, query)
    • DocOwl1.5-Chat inference examples
    from docowl_infer import DocOwlInfer
    docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
    print('load model from ', model_path)
    # VQA with concise phrases
    query='What is the Compound Annual Growth Rate (CAGR) for total assets?'
    answer=docowl.inference(image, query)
    # VQA with detailed explanation
    query='What is the Compound Annual Growth Rate (CAGR) for total assets? Answer the question with detailed explanation.'
    answer=docowl.inference(image, query)

    Model Evaluation

    prepare environments for evaluation as follows:

    pip install textdistance
    pip install editdistance
    pip install pycocoevalcap

    Evaluate DocOwl1.5/DocOwl1.5-Chat on 10 downstream tasks:

    python --model_path $MODEL_PATH --dataset $DATASET --downstream_dir $DOWNSTREAM_DIR_PATH --save_dir $SAVE_DIR

    Note: $DATASET should be chosen from [DocVQA, InfographicsVQA, WikiTableQuestions, DeepForm,KleisterCharity, TabFact, ChartQA, TextVQA, TextCaps, VisualMRC]. $DOWNSTREAM_DIR_PATH is the local path of mPLUG/DocDownstream-1.0.

    Evaluate DocOwl1.5-stage1 on DocLocal4K:

    python --model_path $MODEL_PATH --task $TASK --doclocal4k_dir $DOCLOCAL4K_DIR_PATH --save_dir $SAVE_DIR

    Note: $TASK should be chosen from [text_grounding, text_recognition]. $DOCLOCAL4K_DIR_PATH is the local path of mPLUG/DocLocal4K.

    Model Training

    You can further finetune your own models based on DocOwl 1.5 models.

    1. Prepare a training jsonl file, organize each training sample in the same format as follows: {"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the date mentioned in this letter?"}, {"role": "assistant", "content": "1/8/93"}], "task_name": "qa_sft", "dataset_name": "DocVQA"} (Note: please make sure the number of <|image|> is equal to the number of input images.)

    2. Modify parameters in ./scripts/ or ./scripts/ according to your personal needs. ./scripts/ provides an example of finetuning DocOwl1.5-stage1 on DocDownstream-1.0. ./scripts/ provides an example of finetuning DocOwl1.5-stage1 with LoRA.

    3. Run bash ./scripts/ or bash ./scripts/

    Note: Our DocOwl 1.5 is trained with Megatron. We additionaly build training codes supported by DeepSpeed for open-sourcing. We have tested the training scripts runs well on A100-80g with zero2. But We meet deadlock issues when using zero3. If you are willing to share any ideas about how to fix the deadlock issues of zero3, we will appreciate very much!

    Local Demo

    Run the following command to launch a local demo supported by the DocOwl1.5-Omni:

    python --model-source modelscope

    Note: The demo is build based on gradio==3.27.0. If you must use gradio==4.26.0, you can refer to our code on HuggingFace space by git clone . You can also change model-source to huggingface, or local and specify the model-path. We have verified that the local demo works on A100-80G or V100-32G.


    If you found this work useful, consider giving this repository a star and citing our paper as followed:

      title={mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding},
      author={Hu, Anwen and Xu, Haiyang and Ye, Jiabo and Yan, Ming and Zhang, Liang and Zhang, Bo and Li, Chen and Zhang, Ji and Jin, Qin and Huang, Fei and others},
      journal={arXiv preprint arXiv:2403.12895},