Skip to content

Add support for 336px CLIP vision encoder#25

Merged
yxjiang merged 4 commits intomainfrom
add-336px-vit
Dec 4, 2025
Merged

Add support for 336px CLIP vision encoder#25
yxjiang merged 4 commits intomainfrom
add-336px-vit

Conversation

@yxjiang
Copy link
Member

@yxjiang yxjiang commented Dec 2, 2025

  • Change default vision encoder to clip-vit-large-patch14-336 (336px)
  • Add image_size property to CLIPVisionEncoder
  • Add num_visual_tokens property that calculates dynamically from image size
  • Update max_length defaults to 1024 for both Phase 1 and Phase 2
  • Add validation warnings for insufficient max_length
  • Update checkpoint filenames to include pixel resolution (e.g., checkpoint_phase1_fp16_336px.pt)
  • Update datasets to auto-calculate num_visual_tokens from image processor
  • Update training scripts to pass num_visual_tokens explicitly

- Change default vision encoder to clip-vit-large-patch14-336 (336px)
- Add image_size property to CLIPVisionEncoder
- Add num_visual_tokens property that calculates dynamically from image size
- Update max_length defaults to 1024 for both Phase 1 and Phase 2
- Add validation warnings for insufficient max_length
- Update checkpoint filenames to include pixel resolution (e.g., checkpoint_phase1_fp16_336px.pt)
- Update datasets to auto-calculate num_visual_tokens from image processor
- Update training scripts to pass num_visual_tokens explicitly
@yxjiang yxjiang merged commit 29d888d into main Dec 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant