† Corresponding Author
-
Support Multi-page Text Lookup and Multi-page Text Parsing.
-
Support Multi-page Question Answering using simple phrases or detailed explanations with evidence pages.
-
Support Text-rich Video Understanding.
-
Open Source
- ✅ Training Data: MP-DocStruct1M, MP-DocReason51K, DocDownsteam-2.0, DocGenome12K
- ✅ Model: DocOwl2
- ✅ Source code of model inference and evaluation.
- Model: DocOwl2-stage1, DocOwl2-stage2,
- Online Demo on ModelScope and HuggingFace.
- Source code of launching a local demo.
- Training code.
Dataset | Download Link |
---|---|
MP-DocStruct1M | |
DocDownstream-2.0 | |
MP-DocReason51K | |
DocGenome12K |
Model | Download Link | Abilities |
---|---|---|
DocOwl2 |
import torch
import os
from transformers import AutoTokenizer, AutoModel
from icecream import ic
import time
class DocOwlInfer():
def __init__(self, ckpt_path):
self.tokenizer = AutoTokenizer.from_pretrained(ckpt_path, use_fast=False)
self.model = AutoModel.from_pretrained(ckpt_path, trust_remote_code=True, low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map='auto')
self.model.init_processor(tokenizer=self.tokenizer, basic_image_size=504, crop_anchors='grid_12')
def inference(self, images, query):
messages = [{'role': 'USER', 'content': '<|image|>'*len(images)+query}]
answer = self.model.chat(messages=messages, images=images, tokenizer=self.tokenizer)
return answer
docowl = DocOwlInfer(ckpt_path='mPLUG/DocOwl2')
images = [
'./examples/docowl2_page0.png',
'./examples/docowl2_page1.png',
'./examples/docowl2_page2.png',
'./examples/docowl2_page3.png',
'./examples/docowl2_page4.png',
'./examples/docowl2_page5.png',
]
answer = docowl.inference(images, query='what is this paper about? provide detailed information.')
answer = docowl.inference(images, query='what is the third page about? provide detailed information.')
prepare environments for evaluation as follows:
pip install textdistance
pip install editdistance
pip install pycocoevalcap
Evaluate DocOwl2 on 10 single-image tasks, 2 multi-page tasks and 1 video task:
python docowl_benchmark_evaluate.py --model_path $MODEL_PATH --dataset $DATASET --downstream_dir $DOWNSTREAM_DIR_PATH --save_dir $SAVE_DIR --split $split
Note:
For sinlge-image evaluation, $DATASET
should be chosen from [DocVQA, InfographicsVQA, WikiTableQuestions, DeepForm,KleisterCharity, TabFact, ChartQA, TextVQA, TextCaps, VisualMRC]
. $DOWNSTREAM_DIR_PATH
is the local path of mPLUG/DocDownstream-1.0, $split==test
.
For multi-page evaluation and video evaluation, $DATASET
should be chosen from [MP-DocVQA, DUDE, NewsVideoQA]
. $DOWNSTREAM_DIR_PATH
is the local path of mPLUG/DocDownstream-2.0, $split==val
. You can also set $split==test
and submit the file named with suffix _submission.json
to the official evaluation website.