TextHawk: 🥇 LVLM with 16x Compression Ratio

Base Models

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

TextHawk: Efficient Fine-Grained Perception of Multimodal Large Language Models

GUI Agents

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

Introduction

The TextHawk series represents a cutting-edge family of Large Vision-Language Models (LVLMs) designed for highly efficient fine-grained perception. Notably, TextHawk sets a milestone as the first LVLM to achieve a 16x token compression ratio. This is made possible through the integration of four key components:

Scalable Positional Embeddings (SPEs)
Query Proposal Network (QPN)
ReSampling and ReArrangement (ReSA)
Multi-Level Cross-Attention (MLCA)

Building on the same architecture, TextHawk2 enhances performance by leveraging greater data diversity and reinforcing the visual encoder. This iteration achieves state-of-the-art results across multiple benchmarks, excelling in tasks related to general multimodal understanding, Optical Character Recognition (OCR), and visual grounding.

For instance, TextHawk2 delivers impressive metrics such as 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% [email protected] on RefCOCOg-test.

TextHawk series can compress multiple times more words displayed on a small image, where each character measures under 8 pixels, into a few tokens, allowing for accurate recovery. It’s reminiscent of the futuristic gadgets in Doraemon anime.

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

A brief summary of the document topics.
Short QA pairs, up to 10.
Insights behind each answer.
[Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

Dataset	QA	Conversation
DocVQA	link	link
ChartQA	link	link
InfoVQA	link	link

Note: Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

TextHawk

Model	ViT (Params.)	MME perception	MMB dev	SEED image	GQA	DocVQA	ChartQA	InfoVQA	TabFact	WTQ	RefCOCO val	RefCOCO test-A	RefCOCO test-B
$\text{Donut}$	$\text{Swin-B}$ (0.1B)	-	-	-	-	67.5	41.8	11.6	54.6	18.8	-	-	-
$\text{Pix2Struct}$	-	-	-	-	-	76.6	58.6	40.0	-	-	-	-	-
$\text{InternLM-XC}$	$\text{EVA-G}$ (1B)	1528.4	74.8	66.1	-	-	-	-	-	-	-	-	-
$\text{LLaVA-1.5-7B}$	$\text{CLIP-L}$ (0.3B)	1510.7	65.2	-	62.0	-	-	-	-	-	-	-	-
$\text{Shikra-7B}$	$\text{CLIP-L}$ (0.3B)	-	58.8	-	-	-	-	-	-	-	87.0	91.1	81.8
$\text{Qwen-VL-Chat}$	$\text{CLIP-G}$ (2B)	1487.6	60.6	65.4	57.5	62.6	66.3	-	-	-	88.6	92.3	84.5
$\text{Monkey}$	$\text{CLIP-G}$ (2B)	-	59.3	-	60.7	66.5	65.1	36.1	-	25.3	-	-	-
$\text{UReader}$	$\text{CLIP-L}$ (0.3B)	-	-	-	-	65.4	59.3	42.2	67.6	29.4	-	-	-
$\text{TextMonkey}$	$\text{CLIP-G}$ (2B)	-	-	-	-	73.0	66.9	-	-	31.9	-	-	-
$\textbf{TextHawk}^*$	$\text{SigLIP-SO}$ (0.4B)	1520.9	73.0	69.2	64.7	73.6	64.0	47.3	70.7	33.5	87.3	90.9	83.3
$\textbf{TextHawk}$	$\text{SigLIP-SO}$ (0.4B)	1500.0	74.6	69.2	64.6	76.4	66.6	50.6	71.1	34.7	87.2	90.8	82.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

Visualization

BibTex

@article{yu24texthawk2,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jiwen Zhang and Jihao Wu},
  title        = {TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens},
  journal      = {CoRR},
  volume       = {abs/2410.05261},
  year         = {2024}
}

@article{yu24texthawk,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jihao Wu and Yongxin Liao and Xiaoyu Zheng and Wei Zeng},
  title        = {TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2404.09204},
  year         = {2024}
}

@article{zhang24uihawk,
  title        = {{UI-Hawk}: Unleashing the Screen Stream Understanding for GUI Agents},
  author       = {Jiwen Zhang and Yaqi Yu and Minghui Liao and Wentao Li and Jihao Wu and Zhongyu Wei},
  journal      = {Preprints},
  volume       = {manuscript/202408.2137},
  year         = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
DocGemini		DocGemini
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextHawk: 🥇 LVLM with 16x Compression Ratio

Base Models

GUI Agents

Introduction

DocGemini

Benchmarks

Visualization

BibTex

About

Releases

Packages

Languages

yuyq96/TextHawk

Folders and files

Latest commit

History

Repository files navigation

TextHawk: 🥇 LVLM with 16x Compression Ratio

Base Models

GUI Agents

Introduction

DocGemini

Benchmarks

Visualization

BibTex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages