Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Support new SOTA vision model: Qwen 2.5 VL (3B, 7B, 72B) #724

Open
3 tasks done
ThomasBaruzier opened this issue Jan 27, 2025 · 1 comment
Open
3 tasks done

Comments

@ThomasBaruzier
Copy link

Problem

Hello!

Qwen dropped a new SOTA vision model today, and the 3B variant is on part with Qwen 2 VL 7B which is pretty impressive!

If you want to give it a shot, here are the changes from Qwen 2 VL:


Model Architecture Updates:

  • Dynamic Resolution and Frame Rate Training for Video Understanding:

We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.

  • Streamlined and Efficient Vision Encoder

We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.


From the model card:


Evaluation

Image benchmark

Benchmarks GPT4o Claude3.5 Sonnet Gemini-2-flash InternVL2.5-78B Qwen2-VL-72B Qwen2.5-VL-72B
MMMUval 70.3 70.4 70.7 70.1 64.5 70.2
MMMU_Pro 54.5 54.7 57.0 48.6 46.2 51.1
MathVista_MINI 63.8 65.4 73.1 76.6 70.5 74.8
MathVision_FULL 30.4 38.3 41.3 32.2 25.9 38.1
Hallusion Bench 55.0 55.16 57.4 58.1 55.16
MMBench_DEV_EN_V11 82.1 83.4 83.0 88.5 86.6 88
AI2D_TEST 84.6 81.2 89.1 88.1 88.4
ChartQA_TEST 86.7 90.8 85.2 88.3 88.3 89.5
DocVQA_VAL 91.1 95.2 92.1 96.5 96.1 96.4
MMStar 64.7 65.1 69.4 69.5 68.3 70.8
MMVet_turbo 69.1 70.1 72.3 74.0 76.19
OCRBench 736 788 854 877 885
OCRBench-V2(en/zh) 46.5/32.3 45.2/39.6 51.9/43.1 45/46.2 47.8/46.1 61.5/63.7
CC-OCR 66.6 62.7 73.0 64.7 68.7 79.8

Video benchmark

Benchmarks GPT4o Gemini-1.5-Pro InternVL2.5-78B Qwen2VL-72B Qwen2.5VL-72B
VideoMME w/o sub. 71.9 75.0 72.1 71.2 73.3
VideoMME w sub. 77.2 81.3 74.0 77.8 79.1
MVBench 64.6 60.5 76.4 73.6 70.4
MMBench-Video 1.63 1.30 1.97 1.70 2.02
LVBench 30.8 33.1 - 41.3 47.3
EgoSchema 72.2 71.2 - 77.9 76.2
PerceptionTest_test - - - 68.0 73.2
MLVU_M-Avg_dev 64.6 - 75.7 74.6
TempCompass_overall 73.8 - - 74.8

Agent benchmark

Benchmarks GPT4o Gemini 2.0 Claude Aguvis-72B Qwen2VL-72B Qwen2.5VL-72B
ScreenSpot 18.1 84.0 83.0 87.1
ScreenSpot Pro 17.1 1.6 43.6
AITZ_EM 35.3 72.8 83.2
Android Control High_EM 66.4 59.1 67.36
Android Control Low_EM 84.4 59.2 93.7
AndroidWorld_SR 34.5% (SoM) 27.9% 26.1% 35%
MobileMiniWob++_SR 66% 68%
OSWorld 14.90 10.26 8.83

Solution

Alternatives

No response

Explanation

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@Originalimoc
Copy link

I'd like to add a note that previously Qwen2-VL-72B being quantized to 4.5bpw results in drastic performance lost to the level of much worse than even the 7B full precision. This time the quantizer maybe testing that loss and try to preserve more vision encoder performance(?)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants