Skip to content

zyinan99/DOGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

DOGE

Towards Versatile Visual Document Grounding and Referring

Yinan Zhou*, Yuxin Chen*, Haokun Lin,Shuyu Yang, Li Zhu, Zhongang Qi‡, Chen Ma‡, Ying Shan

*Equal Contribution †Project Lead ‡Corresponding Authors

 

Abstract

In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and rEferring data engine (DOGE-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM's grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct DOGE-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, DOGE. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images.

Code

To be released.

Dataset

To be released.

Video

Watch the introduction video here!

Demo

To be released.

Citation

@misc{zhou2024dogeversatilevisualdocument,
      title={DOGE: Towards Versatile Visual Document Grounding and Referring}, 
      author={Yinan Zhou and Yuxin Chen and Haokun Lin and Shuyu Yang and Li Zhu and Zhongang Qi and Chen Ma and Ying Shan},
      year={2024},
      eprint={2411.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17125}, 
}
}

Please refer to our license file for more details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published