Skip to content

Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Notifications You must be signed in to change notification settings

yuyq96/TextHawk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

TextHawk: 🥇 LVLM with 16x Compression Ratio

arXiv arXiv ZhiHu

Base Models

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

TextHawk: Efficient Fine-Grained Perception of Multimodal Large Language Models

GUI Agents

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

Introduction

The TextHawk series represents a cutting-edge family of Large Vision-Language Models (LVLMs) designed for highly efficient fine-grained perception. Notably, TextHawk sets a milestone as the first LVLM to achieve a 16x token compression ratio. This is made possible through the integration of four key components:

  • Scalable Positional Embeddings (SPEs)
  • Query Proposal Network (QPN)
  • ReSampling and ReArrangement (ReSA)
  • Multi-Level Cross-Attention (MLCA)

architecture

Building on the same architecture, TextHawk2 enhances performance by leveraging greater data diversity and reinforcing the visual encoder. This iteration achieves state-of-the-art results across multiple benchmarks, excelling in tasks related to general multimodal understanding, Optical Character Recognition (OCR), and visual grounding.

For instance, TextHawk2 delivers impressive metrics such as 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% [email protected] on RefCOCOg-test.

compression

TextHawk series can compress multiple times more words displayed on a small image, where each character measures under 8 pixels, into a few tokens, allowing for accurate recovery. It’s reminiscent of the futuristic gadgets in Doraemon anime.

examples

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

  • A brief summary of the document topics.
  • Short QA pairs, up to 10.
  • Insights behind each answer.
  • [Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

Dataset QA Conversation
DocVQA link link
ChartQA link link
InfoVQA link link

Note: Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

ocr

grounding

proprietary

TextHawk
Model ViT
(Params.)
MME
perception
MMB
dev
SEED
image
GQA DocVQA ChartQA InfoVQA TabFact WTQ RefCOCO
val
RefCOCO
test-A
RefCOCO
test-B
$\text{Donut}$ $\text{Swin-B}$
(0.1B)
- - - - 67.5 41.8 11.6 54.6 18.8 - - -
$\text{Pix2Struct}$ - - - - - 76.6 58.6 40.0 - - - - -
$\text{InternLM-XC}$ $\text{EVA-G}$
(1B)
1528.4 74.8 66.1 - - - - - - - - -
$\text{LLaVA-1.5-7B}$ $\text{CLIP-L}$
(0.3B)
1510.7 65.2 - 62.0 - - - - - - - -
$\text{Shikra-7B}$ $\text{CLIP-L}$
(0.3B)
- 58.8 - - - - - - - 87.0 91.1 81.8
$\text{Qwen-VL-Chat}$ $\text{CLIP-G}$
(2B)
1487.6 60.6 65.4 57.5 62.6 66.3 - - - 88.6 92.3 84.5
$\text{Monkey}$ $\text{CLIP-G}$
(2B)
- 59.3 - 60.7 66.5 65.1 36.1 - 25.3 - - -
$\text{UReader}$ $\text{CLIP-L}$
(0.3B)
- - - - 65.4 59.3 42.2 67.6 29.4 - - -
$\text{TextMonkey}$ $\text{CLIP-G}$
(2B)
- - - - 73.0 66.9 - - 31.9 - - -
$\textbf{TextHawk}^*$ $\text{SigLIP-SO}$
(0.4B)
1520.9 73.0 69.2 64.7 73.6 64.0 47.3 70.7 33.5 87.3 90.9 83.3
$\textbf{TextHawk}$ $\text{SigLIP-SO}$
(0.4B)
1500.0 74.6 69.2 64.6 76.4 66.6 50.6 71.1 34.7 87.2 90.8 82.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

Visualization

markdown

reg

BibTex

@article{yu24texthawk2,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jiwen Zhang and Jihao Wu},
  title        = {TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens},
  journal      = {CoRR},
  volume       = {abs/2410.05261},
  year         = {2024}
}
@article{yu24texthawk,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jihao Wu and Yongxin Liao and Xiaoyu Zheng and Wei Zeng},
  title        = {TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2404.09204},
  year         = {2024}
}
@article{zhang24uihawk,
  title        = {{UI-Hawk}: Unleashing the Screen Stream Understanding for GUI Agents},
  author       = {Jiwen Zhang and Yaqi Yu and Minghui Liao and Wentao Li and Jihao Wu and Zhongyu Wei},
  journal      = {Preprints},
  volume       = {manuscript/202408.2137},
  year         = {2024}
}

About

Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages