Add support for 336px CLIP vision encoder by yxjiang · Pull Request #25 · small-thinking/vlm

yxjiang · 2025-12-02T07:13:07Z

Change default vision encoder to clip-vit-large-patch14-336 (336px)
Add image_size property to CLIPVisionEncoder
Add num_visual_tokens property that calculates dynamically from image size
Update max_length defaults to 1024 for both Phase 1 and Phase 2
Add validation warnings for insufficient max_length
Update checkpoint filenames to include pixel resolution (e.g., checkpoint_phase1_fp16_336px.pt)
Update datasets to auto-calculate num_visual_tokens from image processor
Update training scripts to pass num_visual_tokens explicitly

- Change default vision encoder to clip-vit-large-patch14-336 (336px) - Add image_size property to CLIPVisionEncoder - Add num_visual_tokens property that calculates dynamically from image size - Update max_length defaults to 1024 for both Phase 1 and Phase 2 - Add validation warnings for insufficient max_length - Update checkpoint filenames to include pixel resolution (e.g., checkpoint_phase1_fp16_336px.pt) - Update datasets to auto-calculate num_visual_tokens from image processor - Update training scripts to pass num_visual_tokens explicitly

yxjiang added 4 commits December 1, 2025 22:55

update

7258a27

update

1067066

update

a7d630a

yxjiang merged commit 29d888d into main Dec 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 336px CLIP vision encoder#25

Add support for 336px CLIP vision encoder#25
yxjiang merged 4 commits intomainfrom
add-336px-vit

yxjiang commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yxjiang commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant