Llava video #97

nssmd · 2025-11-27T06:13:02Z

Motivation

This PR adds comprehensive support for LLaVA-Video training to the LMMs-Engine framework. The main goals are:

Enable video understanding capabilities - Support training multimodal models on video data with temporal reasoning
Implement slow-fast frame processing - Reduce video tokens by 50% while maintaining quality through adaptive resolution processing
Add time instruction injection - Provide temporal awareness by injecting frame timestamps and video duration into prompts
Ensure compatibility - Seamlessly integrate with existing LLaVA-OneVision models and training infrastructure

This implementation is based on LLaVA-NeXT's video processing approach and is compatible with transformers v4.57.1.

Modifications

1. Video-Aware Model Forward Pass

New file: src/lmms_engine/models/llava_onevision/llava_video_forward.py (315 lines)

Implemented custom forward pass for LlavaOnevisionModel with slow-fast frame support
Slow-Fast Frame Processing:
- Slow frames (every Nth frame): High-resolution features with stride=2 pooling
- Fast frames (intermediate): Low-resolution features with stride=4 pooling
- Learnable faster_token parameter to mark frame types
Configurable Spatial Pooling: Supports bilinear, average, and max pooling modes
2D Spatial Pooling Function: apply_2d_pool() reshapes features to 2D spatial layout and applies efficient token reduction

2. Model Monkey Patching Infrastructure

New file: src/lmms_engine/models/llava_onevision/monkey_patch.py (146 lines)

Liger Kernel Patch (@MONKEY_PATCHER.register("llava_onevision", "liger")):
- Applies Liger kernel optimizations (fused linear cross-entropy, RMSPad)
- ~15% training speedup with reduced memory overhead
Video Extension Patch (@MONKEY_PATCHER.register("llava_onevision", "video")):
- Enables slow-fast frame processing
- Initializes learnable faster_token parameter
- Replaces model forward with video-aware version
- Can be applied independently or combined with Liger patch

Configuration example:

model_config:
  model_type: llava_onevision
  monkey_patch_kwargs:
    patch_type: ["liger", "video"]
    add_faster_video: true
    faster_token_stride: 10
    mm_spatial_pool_stride: 2
    mm_spatial_pool_mode: bilinear

3. Video Data Processor

New file: src/lmms_engine/datasets/processor/llava_video_processor.py (274 lines)

Video Token Expansion: Properly expands <video> tokens to actual feature tokens
- Token count: (num_frames * pooled_height * pooled_width) + 1
- Aligns with transformers LlavaOnevisionProcessor
Mixed Image/Video Processing: Handles both modalities in the same batch
- Separate storage: pixel_values (images) and pixel_values_videos (videos)

Time Instruction Injection: inject_time_instruction() method adds temporal context

The video lasts for 10.00 seconds, and 16 frames are uniformly sampled from it.
These frames are located at 0.00s,0.67s,1.33s,2.00s,...
Please answer the following questions related to this video.

Label Creation: Proper masking for training (user/system messages masked as -100, assistant responses kept)

4. Video Dataset Implementation

New file: src/lmms_engine/datasets/naive/llava_video_dataset.py (371 lines)

Multiple Data Format Support:
- JSON format with messages
- CSV format with video + prompt columns
- HuggingFace dataset format
Flexible Video Loading:
- Video files (MP4, AVI, etc.) via decord backend
- Pre-extracted frames directory (ShareGPTVideo/LLaVA-Hound format)
- Automatic format detection
Time Metadata Extraction:
- Total video duration, frame timestamps, sampling FPS, number of frames

Configurable Frame Sampling:

extra_kwargs:
  frames_upbound: 16        # Max frames per video
  force_sample: true        # Force uniform sampling
  add_time_instruction: true # Inject time context

Robust Error Handling: Gracefully skips corrupted videos and tries next sample

5. Video Loading Utilities

Modified: src/lmms_engine/datasets/multimodal_mixin.py (+81 lines)

Added load_video_with_time() method to extract video frames with temporal metadata
Returns both frames tensor and metadata dict (video_time, frame_time, num_frames, sample_fps)

6. Example Configuration

New file: examples/llava_video/llava_video_qwen2_7b.yaml (257 lines)

Complete training configuration demonstrating:
- Dataset configuration with video parameters
- Processor setup with llava_video type
- Model configuration with video patches
- FSDP2 training settings
- Memory-efficient defaults (16 frames)

Commit Message Convention

Please follow our standardized commit message format:

[feat] - New features or functionality
[fix] - Bug fixes
[docs] - Documentation changes only
[style] - Code style changes (formatting, missing semicolons, etc.)
[refactor] - Code refactoring without changing functionality
[perf] - Performance improvements
[test] - Adding or updating tests
[chore] - Maintenance tasks, dependency updates, etc.
[ci] - CI/CD configuration changes

This PR:

[feat] add llava-video support with slow-fast frame processing

See CONTRIBUTING.md for more details.

CI/CD Checks

Your PR will automatically run the following checks:

Linting: Code formatting with black (line-length=120) and import sorting with isort
Run pre-commit run --all-files locally to verify before pushing

Checklist

Follow commit message convention (see above)
Run pre-commit run --all-files and ensure all checks pass
Format code with black (line-length=120) and isort
Add unit tests for new functionality
Update documentation as needed, including docstrings and example configuration
Ensure all CI/CD checks pass

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/lmms_engine/datasets/processor/llava_video_processor.py

kcz358 · 2025-11-28T02:32:20Z

src/lmms_engine/datasets/naive/multimodal_dataset.py

There is a logic for loading parquet dataset here

lmms-engine/src/lmms_engine/utils/data_utils.py

Lines 65 to 70 in 4a9a3c7

if data_type == "arrow":

dataset = load_from_disk(path)

elif data_type == "parquet":

dataset = Dataset.from_parquet(path)

else:

dataset = DataUtilities.maybe_load_json_or_jsonlines_or_csv(path, data_type)

wondering if changing the very top multimodal dataset will break this change. Since you override the self.load_from_hf. I think you can simply change the dataset_format to hf_dataset instead of creating a new one.

kcz358 · 2025-11-28T02:35:25Z

src/lmms_engine/datasets/multimodal_mixin.py

May consider to put this logic into llava video dataset. Adding a very specific function to a high level abstract dataset may not be a good choice. I think can refer to the implementation to qwen3 vl that also need a custom video loading operation. Thanks!

lmms-engine/src/lmms_engine/datasets/iterable/qwen3_vl_iterable_dataset.py

Lines 54 to 99 in 4a9a3c7

def load_videos(self, video_path: str, data_folder=None, fps: int = 1):

assert (

self.config.video_backend == "qwen_vl_utils"

), "Qwen3VLIterableDataset only supports qwen_vl_utils backend"

frames, video_metadata, sample_fps = self.load_video_qwen_vl_utils(video_path, fps)

return frames, video_metadata, sample_fps

def load_video_qwen_vl_utils(

self,

video_path: str,

fps: int,

) -> Tuple[np.ndarray, float]:

"""

Load video using Qwen VL utils.

Args:

video_path: Path to video file

fps: Target frames per second

Returns:

Tuple of (video frames, video metadata, sample fps)

"""

video_dict = {

"type": "video",

"video": f"file://{video_path}",

"min_frames": 1,

"max_pixels": self.config.video_max_pixels,

"max_frames": self.config.video_max_frames,

"min_pixels": self.config.video_min_pixels,

}

if self.config.video_sampling_strategy == "frame_num":

n_frames = self.config.frame_num

video_dict["nframes"] = n_frames

video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True)

frames, video_metadata = video_inputs

frames = frames.numpy()

return frames, video_metadata, sample_fps

elif self.config.video_sampling_strategy == "fps":

video_dict["fps"] = fps

video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True)

frames, video_metadata = video_inputs

frames = frames.numpy()

return frames, video_metadata, sample_fps

else:

raise ValueError(f"Invalid video sampling strategy: {self.config.video_sampling_strategy}")

nssmd added 11 commits November 17, 2025 17:41

commit

b79d635

revise

b75733b

11

051d39f

processor

2312b00

11

92b5675

11

ac2eab2

new

5e5e85e

112

c29dda0

11

423660e

11

6a25bc4

uodate

4c26a58

chatgpt-codex-connector bot reviewed Nov 27, 2025

View reviewed changes

src/lmms_engine/datasets/processor/llava_video_processor.py Outdated Show resolved Hide resolved

nssmd added 2 commits November 27, 2025 16:20

update

5e5c66f

lint

d0b5da3

kcz358 reviewed Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llava video #97

Llava video #97

Uh oh!

nssmd commented Nov 27, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

kcz358 Nov 28, 2025

Uh oh!

kcz358 Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if data_type == "arrow":
	dataset = load_from_disk(path)
	elif data_type == "parquet":
	dataset = Dataset.from_parquet(path)
	else:
	dataset = DataUtilities.maybe_load_json_or_jsonlines_or_csv(path, data_type)

	def load_videos(self, video_path: str, data_folder=None, fps: int = 1):
	assert (
	self.config.video_backend == "qwen_vl_utils"
	), "Qwen3VLIterableDataset only supports qwen_vl_utils backend"
	frames, video_metadata, sample_fps = self.load_video_qwen_vl_utils(video_path, fps)
	return frames, video_metadata, sample_fps

	def load_video_qwen_vl_utils(
	self,
	video_path: str,
	fps: int,
	) -> Tuple[np.ndarray, float]:
	"""
	Load video using Qwen VL utils.

	Args:
	video_path: Path to video file
	fps: Target frames per second

	Returns:
	Tuple of (video frames, video metadata, sample fps)
	"""
	video_dict = {
	"type": "video",
	"video": f"file://{video_path}",
	"min_frames": 1,
	"max_pixels": self.config.video_max_pixels,
	"max_frames": self.config.video_max_frames,
	"min_pixels": self.config.video_min_pixels,
	}

	if self.config.video_sampling_strategy == "frame_num":
	n_frames = self.config.frame_num
	video_dict["nframes"] = n_frames
	video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True)
	frames, video_metadata = video_inputs
	frames = frames.numpy()
	return frames, video_metadata, sample_fps
	elif self.config.video_sampling_strategy == "fps":
	video_dict["fps"] = fps
	video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True)
	frames, video_metadata = video_inputs
	frames = frames.numpy()
	return frames, video_metadata, sample_fps
	else:
	raise ValueError(f"Invalid video sampling strategy: {self.config.video_sampling_strategy}")

Llava video #97

Are you sure you want to change the base?

Llava video #97

Uh oh!

Conversation

nssmd commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

1. Video-Aware Model Forward Pass

2. Model Monkey Patching Infrastructure

3. Video Data Processor

4. Video Dataset Implementation

5. Video Loading Utilities

6. Example Configuration

Commit Message Convention

CI/CD Checks

Checklist

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kcz358 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

kcz358 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nssmd commented Nov 27, 2025 •

edited

Loading