[REQUEST] Support new SOTA vision model: Qwen 2.5 VL (3B, 7B, 72B) #724

ThomasBaruzier · 2025-01-27T22:19:40Z

Problem

Hello!

Qwen dropped a new SOTA vision model today, and the 3B variant is on part with Qwen 2 VL 7B which is pretty impressive!

If you want to give it a shot, here are the changes from Qwen 2 VL:

Model Architecture Updates:

Dynamic Resolution and Frame Rate Training for Video Understanding:

We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.

Streamlined and Efficient Vision Encoder

We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.

From the model card:

Evaluation

Image benchmark

Benchmarks	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

Video benchmark

Benchmarks	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1	-	41.3	47.3
EgoSchema	72.2	71.2	-	77.9	76.2
PerceptionTest_test	-	-	-	68.0	73.2
MLVU_M-Avg_dev	64.6	-	75.7		74.6
TempCompass_overall	73.8	-	-		74.8

Agent benchmark

Benchmarks	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

Solution

Alternatives

No response

Explanation

Examples

No response

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

Originalimoc · 2025-01-29T03:07:15Z

I'd like to add a note that previously Qwen2-VL-72B being quantized to 4.5bpw results in drastic performance lost to the level of much worse than even the 7B full precision. This time the quantizer maybe testing that loss and try to preserve more vision encoder performance(?)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Support new SOTA vision model: Qwen 2.5 VL (3B, 7B, 72B) #724

[REQUEST] Support new SOTA vision model: Qwen 2.5 VL (3B, 7B, 72B) #724

ThomasBaruzier commented Jan 27, 2025

Originalimoc commented Jan 29, 2025

[REQUEST] Support new SOTA vision model: Qwen 2.5 VL (3B, 7B, 72B) #724

[REQUEST] Support new SOTA vision model: Qwen 2.5 VL (3B, 7B, 72B) #724

Comments

ThomasBaruzier commented Jan 27, 2025

Problem

Model Architecture Updates:

Evaluation

Image benchmark

Video benchmark

Agent benchmark

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

Originalimoc commented Jan 29, 2025