You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qwen dropped a new SOTA vision model today, and the 3B variant is on part with Qwen 2 VL 7B which is pretty impressive!
If you want to give it a shot, here are the changes from Qwen 2 VL:
Model Architecture Updates:
Dynamic Resolution and Frame Rate Training for Video Understanding:
We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
Streamlined and Efficient Vision Encoder
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
From the model card:
Evaluation
Image benchmark
Benchmarks
GPT4o
Claude3.5 Sonnet
Gemini-2-flash
InternVL2.5-78B
Qwen2-VL-72B
Qwen2.5-VL-72B
MMMUval
70.3
70.4
70.7
70.1
64.5
70.2
MMMU_Pro
54.5
54.7
57.0
48.6
46.2
51.1
MathVista_MINI
63.8
65.4
73.1
76.6
70.5
74.8
MathVision_FULL
30.4
38.3
41.3
32.2
25.9
38.1
Hallusion Bench
55.0
55.16
57.4
58.1
55.16
MMBench_DEV_EN_V11
82.1
83.4
83.0
88.5
86.6
88
AI2D_TEST
84.6
81.2
89.1
88.1
88.4
ChartQA_TEST
86.7
90.8
85.2
88.3
88.3
89.5
DocVQA_VAL
91.1
95.2
92.1
96.5
96.1
96.4
MMStar
64.7
65.1
69.4
69.5
68.3
70.8
MMVet_turbo
69.1
70.1
72.3
74.0
76.19
OCRBench
736
788
854
877
885
OCRBench-V2(en/zh)
46.5/32.3
45.2/39.6
51.9/43.1
45/46.2
47.8/46.1
61.5/63.7
CC-OCR
66.6
62.7
73.0
64.7
68.7
79.8
Video benchmark
Benchmarks
GPT4o
Gemini-1.5-Pro
InternVL2.5-78B
Qwen2VL-72B
Qwen2.5VL-72B
VideoMME w/o sub.
71.9
75.0
72.1
71.2
73.3
VideoMME w sub.
77.2
81.3
74.0
77.8
79.1
MVBench
64.6
60.5
76.4
73.6
70.4
MMBench-Video
1.63
1.30
1.97
1.70
2.02
LVBench
30.8
33.1
-
41.3
47.3
EgoSchema
72.2
71.2
-
77.9
76.2
PerceptionTest_test
-
-
-
68.0
73.2
MLVU_M-Avg_dev
64.6
-
75.7
74.6
TempCompass_overall
73.8
-
-
74.8
Agent benchmark
Benchmarks
GPT4o
Gemini 2.0
Claude
Aguvis-72B
Qwen2VL-72B
Qwen2.5VL-72B
ScreenSpot
18.1
84.0
83.0
87.1
ScreenSpot Pro
17.1
1.6
43.6
AITZ_EM
35.3
72.8
83.2
Android Control High_EM
66.4
59.1
67.36
Android Control Low_EM
84.4
59.2
93.7
AndroidWorld_SR
34.5% (SoM)
27.9%
26.1%
35%
MobileMiniWob++_SR
66%
68%
OSWorld
14.90
10.26
8.83
Solution
Alternatives
No response
Explanation
Examples
No response
Additional context
No response
Acknowledgements
I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.
The text was updated successfully, but these errors were encountered:
I'd like to add a note that previously Qwen2-VL-72B being quantized to 4.5bpw results in drastic performance lost to the level of much worse than even the 7B full precision. This time the quantizer maybe testing that loss and try to preserve more vision encoder performance(?)?
Problem
Hello!
Qwen dropped a new SOTA vision model today, and the 3B variant is on part with Qwen 2 VL 7B which is pretty impressive!
If you want to give it a shot, here are the changes from Qwen 2 VL:
Model Architecture Updates:
We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
From the model card:
Evaluation
Image benchmark
Video benchmark
Agent benchmark
Solution
Alternatives
No response
Explanation
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: