VLM for Marin #2298

ruili33 · 2026-01-07T23:36:33Z

Description

This PR adds Vision-Language Model (VLM) training support to Levanter, with a focus on the LLaVA OneVision architecture.

Key Changes

New Features

SigLIP & Siglip2 Vision Encoder (models/siglip.py & models/siglip2.py)

Full implementation of SigLIP (Sigmoid Loss for Language Image Pre-Training) and Siglip2 vision encoder

LLaVA OneVision Model (models/llava_onevision.py)

Complete multimodal model combining SigLIP/Siglip2 vision encoders with Qwen language models
Support for loading HuggingFace pretrained weights
Inference engine integration with KV cache support

Image Data Pipeline (data/image.py)

Image preprocessing pipeline for VLM training
Support for multiple data sources: URLs, HuggingFace datasets, parquet files
Conversation-format data handling with interleaved images and text

VLM Training Infrastructure

train_vlm.py: End-to-end VLM training main script
launch_vlm_training.py: Launch script with TPUoptimizations
ImageDataLoader in data/loader.py: Specialized data loader for variable-length image patches with proper batching and padding

Data Sources (data/sharded_datasource.py)

ImageTextUrlDataSource: Dataset for image-text pairs from JSON/JSONL/Parquet
ConversationUrlDataSource: Dataset for conversation-format VLM training data

Improvements

Splash Attention Explicit Mask Support (layers/attention.py)

Implemented explicit mask support for TPU Splash Attention
Converts NamedArray explicit masks to NumpyMask for Splash Attention compatibility
Proper error handling for dynamic masks during JIT tracing

Qwen Model (models/qwen.py)

Added decode() method to QwenDecoderLayer for paged decoding with KV cache

HuggingFace Checkpoint Compatibility (compat/hf_checkpoints.py)

Extended vocab_size lookup to support multimodal models (e.g., LlavaOnevision with nested text_config)

Cache Improvements (store/cache.py)

Added progress bar with total row count for shard building
Fixed bug in _extend_cache_metadata_with_other: now correctly slices shape data to actual row count instead of copying entire pre-allocated shapes store

…ash attention

dlwh

haven't finished reviewinf model and tests yet

lib/levanter/scripts/launch_vlm_training.py

dlwh · 2026-01-08T20:51:43Z

lib/levanter/src/levanter/compat/hf_checkpoints.py


    # deshard. We could be smarter here and use a process mesh or host offloading, but this is simpler for now
-    state_dict = jax.lax.with_sharding_constraint(state_dict, PartitionSpec())
+    mesh = get_concrete_mesh()


we don't want the concrete mesh inside jit in general since it breaks compilation caching. can we do abstract?

also why is this necessary?

It's not so necessary. I added this because it would throws errors for ckpt saved outside of meshes because that it need a no empty mesh to work.

lib/levanter/src/levanter/compat/hf_checkpoints.py

lib/levanter/tests/test_siglip2.py

lib/levanter/src/levanter/main/train_vlm.py

dlwh · 2026-01-09T06:10:25Z

lib/levanter/src/levanter/main/train_vlm.py

+            return LlavaOnevisionModel.init(Vocab, config.model, key=model_key)
+
+        # For freezing, we use is_trainable=True and handle gradient zeroing separately
+        # This avoids haliax partitioning issues with non-trivial is_trainable filters


wait what's wrong

When passing a non-trivial is_trainable filter (e.g., {"vision_tower": False, "projector": True}) to the trainer, I ran into issues with Haliax's partitioning/sharding logic - it had trouble computing consistent axis mappings for the non-uniform set of trainable parameters. Using is_trainable=True and applying jax.lax.stop_gradient() in the loss function achieves the same freezing behavior while keeping the model structure uniform from Haliax's perspective.

dlwh · 2026-01-09T06:11:41Z

lib/levanter/src/levanter/models/llava_onevision.py

+from transformers import LlavaOnevisionConfig as HfLlavaOnevisionConfig  # noqa: E402
+
+
+@LmConfig.register_subclass("llava_onevision")


should this be an LmConfig at all

Adding VlmConfig.

lib/levanter/src/levanter/models/llava_onevision.py

dlwh · 2026-01-09T06:17:30Z

lib/levanter/src/levanter/models/llava_onevision.py

+            num_unpadded_tokens = unpad_indices.axis_size("num_image_tokens")
+
+            # Gather features in HF's unpadded order
+            image_features_reordered = self._batch_gather(image_features_flat.array, unpad_indices.array)


pretty sure you shouldn't need this. haliax ought to handle this case i think with its indexing though maybe i don't understand it

Why unpad_indices: HuggingFace's LLaVA OneVision applies spatial unpadding based on aspect ratio after vision encoding (landscape images remove top/bottom padding, portraits remove left/right). Since Levanter uses fixed-size tensors, we precompute unpad_indices to map our padded features back to HF's spatial order.

Why _batch_gather: We need per-batch dynamic indexing - each image has different aspect ratio → different indices. Haliax indexing works for uniform operations across batches, but here each batch element needs its own index set. vmap(lambda arr, idx: arr[idx]) is the cleanest way to express this in JAX.

dlwh

Can we reduce the tests by a lot?

don't need tests that are basically reimplmentations of the class but as asserts
asserting shapes isn't useful if that's all we're doing
don't print so much in tests
tolerances should be 1e-4 for single layer, 1e-3 for multi layer unless we're pretty sure they're right

let's not set env vars at import time. most of the jax config updates shouldn't be necsesary

lib/levanter/src/levanter/store/cache.py

lib/levanter/tests/test_image.py

lib/levanter/tests/test_llava_onevision.py

Improve usage

ruili33 · 2026-01-11T04:58:15Z

Can we reduce the tests by a lot?

don't need tests that are basically reimplmentations of the class but as asserts

asserting shapes isn't useful if that's all we're doing

don't print so much in tests

tolerances should be 1e-4 for single layer, 1e-3 for multi layer unless we're pretty sure they're right

let's not set env vars at import time. most of the jax config updates shouldn't be necsesary

I'm working to reduce the tests. For the Jax configs, in my previous experience, if we don't force Jax to do float32 calculation, the results would differs a lot from hf ones.

ruili33 · 2026-01-11T05:42:51Z

Can we reduce the tests by a lot?

don't need tests that are basically reimplmentations of the class but as asserts

asserting shapes isn't useful if that's all we're doing

don't print so much in tests

tolerances should be 1e-4 for single layer, 1e-3 for multi layer unless we're pretty sure they're right

let's not set env vars at import time. most of the jax config updates shouldn't be necsesary

I'm working to reduce the tests. For the Jax configs, in my previous experience, if we don't force Jax to do float32 calculation, the results would differs a lot from hf ones.

And also, for siglip the mean difference can go lower than 1e-3, but the max difference only can go lower than 1e-2. Is this expected? I double checked for many times, the implementation should be correct.

ruili33 and others added 10 commits December 17, 2025 01:14

Adding support for Siglip and Siglip2 vision encoders.

9cb77cb

Merge branch 'marin-community:main' into main

0611a95

Merge branch 'marin-community:main' into main

0ad0779

initial VLM commit

107e102

fix test data loading

27df0ec

fix tests on new environment

dbd300b

fix oom error

4158820

Padding siglip for attention. Adding support for explicit mask on spl…

f07d615

…ash attention

fix lint problems

09d524d

Merge branch 'main' into main

4833cda

ruili33 requested review from Helw150 and dlwh January 7, 2026 23:38

Delete output

12a2ea5

dlwh reviewed Jan 9, 2026

View reviewed changes

dlwh requested changes Jan 9, 2026

View reviewed changes

ruili33 and others added 4 commits January 9, 2026 02:44

Update launch_vlm_training.py

0f6e2d8

Improve usage

fixed image.py

be94e27

Merge branch 'main' of github.com:ruili33/marin

fb8114b

fix number of patches

2597207

ruili33 added 3 commits January 11, 2026 10:17

update for test

5936b3b

adding generation tests

bf4af14

fix linter issue

b25cd75

ruili33 requested a review from dlwh January 12, 2026 05:18

Merge branch 'marin-community:main' into main

60a0258

		from transformers import LlavaOnevisionConfig as HfLlavaOnevisionConfig # noqa: E402


		@LmConfig.register_subclass("llava_onevision")

VLM for Marin #2298

Are you sure you want to change the base?

VLM for Marin #2298

Uh oh!

Conversation

ruili33 commented Jan 7, 2026

Description

Key Changes

New Features

SigLIP & Siglip2 Vision Encoder (models/siglip.py & models/siglip2.py)

LLaVA OneVision Model (models/llava_onevision.py)

Image Data Pipeline (data/image.py)

VLM Training Infrastructure

Data Sources (data/sharded_datasource.py)

Improvements

Splash Attention Explicit Mask Support (layers/attention.py)

Qwen Model (models/qwen.py)

HuggingFace Checkpoint Compatibility (compat/hf_checkpoints.py)

Cache Improvements (store/cache.py)

Uh oh!

dlwh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlwh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruili33 commented Jan 11, 2026

Uh oh!

ruili33 commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants