Skip to content

Conversation

@nssmd
Copy link
Collaborator

@nssmd nssmd commented Nov 27, 2025

Motivation

This PR adds comprehensive support for LLaVA-Video training to the LMMs-Engine framework. The main goals are:

  1. Enable video understanding capabilities - Support training multimodal models on video data with temporal reasoning
  2. Implement slow-fast frame processing - Reduce video tokens by 50% while maintaining quality through adaptive resolution processing
  3. Add time instruction injection - Provide temporal awareness by injecting frame timestamps and video duration into prompts
  4. Ensure compatibility - Seamlessly integrate with existing LLaVA-OneVision models and training infrastructure

This implementation is based on LLaVA-NeXT's video processing approach and is compatible with transformers v4.57.1.

Modifications

1. Video-Aware Model Forward Pass

New file: src/lmms_engine/models/llava_onevision/llava_video_forward.py (315 lines)

  • Implemented custom forward pass for LlavaOnevisionModel with slow-fast frame support
  • Slow-Fast Frame Processing:
    • Slow frames (every Nth frame): High-resolution features with stride=2 pooling
    • Fast frames (intermediate): Low-resolution features with stride=4 pooling
    • Learnable faster_token parameter to mark frame types
  • Configurable Spatial Pooling: Supports bilinear, average, and max pooling modes
  • 2D Spatial Pooling Function: apply_2d_pool() reshapes features to 2D spatial layout and applies efficient token reduction

2. Model Monkey Patching Infrastructure

New file: src/lmms_engine/models/llava_onevision/monkey_patch.py (146 lines)

  • Liger Kernel Patch (@MONKEY_PATCHER.register("llava_onevision", "liger")):

    • Applies Liger kernel optimizations (fused linear cross-entropy, RMSPad)
    • ~15% training speedup with reduced memory overhead
  • Video Extension Patch (@MONKEY_PATCHER.register("llava_onevision", "video")):

    • Enables slow-fast frame processing
    • Initializes learnable faster_token parameter
    • Replaces model forward with video-aware version
    • Can be applied independently or combined with Liger patch

Configuration example:

model_config:
  model_type: llava_onevision
  monkey_patch_kwargs:
    patch_type: ["liger", "video"]
    add_faster_video: true
    faster_token_stride: 10
    mm_spatial_pool_stride: 2
    mm_spatial_pool_mode: bilinear

3. Video Data Processor

New file: src/lmms_engine/datasets/processor/llava_video_processor.py (274 lines)

  • Video Token Expansion: Properly expands <video> tokens to actual feature tokens
    • Token count: (num_frames * pooled_height * pooled_width) + 1
    • Aligns with transformers LlavaOnevisionProcessor
  • Mixed Image/Video Processing: Handles both modalities in the same batch
    • Separate storage: pixel_values (images) and pixel_values_videos (videos)
  • Time Instruction Injection: inject_time_instruction() method adds temporal context
    The video lasts for 10.00 seconds, and 16 frames are uniformly sampled from it.
    These frames are located at 0.00s,0.67s,1.33s,2.00s,...
    Please answer the following questions related to this video.
    
  • Label Creation: Proper masking for training (user/system messages masked as -100, assistant responses kept)

4. Video Dataset Implementation

New file: src/lmms_engine/datasets/naive/llava_video_dataset.py (371 lines)

  • Multiple Data Format Support:

    • JSON format with messages
    • CSV format with video + prompt columns
    • HuggingFace dataset format
  • Flexible Video Loading:

    • Video files (MP4, AVI, etc.) via decord backend
    • Pre-extracted frames directory (ShareGPTVideo/LLaVA-Hound format)
    • Automatic format detection
  • Time Metadata Extraction:

    • Total video duration, frame timestamps, sampling FPS, number of frames
  • Configurable Frame Sampling:

    extra_kwargs:
      frames_upbound: 16        # Max frames per video
      force_sample: true        # Force uniform sampling
      add_time_instruction: true # Inject time context
  • Robust Error Handling: Gracefully skips corrupted videos and tries next sample

5. Video Loading Utilities

Modified: src/lmms_engine/datasets/multimodal_mixin.py (+81 lines)

  • Added load_video_with_time() method to extract video frames with temporal metadata
  • Returns both frames tensor and metadata dict (video_time, frame_time, num_frames, sample_fps)

6. Example Configuration

New file: examples/llava_video/llava_video_qwen2_7b.yaml (257 lines)

  • Complete training configuration demonstrating:
    • Dataset configuration with video parameters
    • Processor setup with llava_video type
    • Model configuration with video patches
    • FSDP2 training settings
    • Memory-efficient defaults (16 frames)

Commit Message Convention

Please follow our standardized commit message format:

  • [feat] - New features or functionality
  • [fix] - Bug fixes
  • [docs] - Documentation changes only
  • [style] - Code style changes (formatting, missing semicolons, etc.)
  • [refactor] - Code refactoring without changing functionality
  • [perf] - Performance improvements
  • [test] - Adding or updating tests
  • [chore] - Maintenance tasks, dependency updates, etc.
  • [ci] - CI/CD configuration changes

This PR:

  • [feat] add llava-video support with slow-fast frame processing

See CONTRIBUTING.md for more details.

CI/CD Checks

Your PR will automatically run the following checks:

  • Linting: Code formatting with black (line-length=120) and import sorting with isort
  • Run pre-commit run --all-files locally to verify before pushing

Checklist

  • Follow commit message convention (see above)
  • Run pre-commit run --all-files and ensure all checks pass
  • Format code with black (line-length=120) and isort
  • Add unit tests for new functionality
  • Update documentation as needed, including docstrings and example configuration
  • Ensure all CI/CD checks pass

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a logic for loading parquet dataset here

if data_type == "arrow":
dataset = load_from_disk(path)
elif data_type == "parquet":
dataset = Dataset.from_parquet(path)
else:
dataset = DataUtilities.maybe_load_json_or_jsonlines_or_csv(path, data_type)

wondering if changing the very top multimodal dataset will break this change. Since you override the self.load_from_hf. I think you can simply change the dataset_format to hf_dataset instead of creating a new one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May consider to put this logic into llava video dataset. Adding a very specific function to a high level abstract dataset may not be a good choice. I think can refer to the implementation to qwen3 vl that also need a custom video loading operation. Thanks!

def load_videos(self, video_path: str, data_folder=None, fps: int = 1):
assert (
self.config.video_backend == "qwen_vl_utils"
), "Qwen3VLIterableDataset only supports qwen_vl_utils backend"
frames, video_metadata, sample_fps = self.load_video_qwen_vl_utils(video_path, fps)
return frames, video_metadata, sample_fps
def load_video_qwen_vl_utils(
self,
video_path: str,
fps: int,
) -> Tuple[np.ndarray, float]:
"""
Load video using Qwen VL utils.
Args:
video_path: Path to video file
fps: Target frames per second
Returns:
Tuple of (video frames, video metadata, sample fps)
"""
video_dict = {
"type": "video",
"video": f"file://{video_path}",
"min_frames": 1,
"max_pixels": self.config.video_max_pixels,
"max_frames": self.config.video_max_frames,
"min_pixels": self.config.video_min_pixels,
}
if self.config.video_sampling_strategy == "frame_num":
n_frames = self.config.frame_num
video_dict["nframes"] = n_frames
video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True)
frames, video_metadata = video_inputs
frames = frames.numpy()
return frames, video_metadata, sample_fps
elif self.config.video_sampling_strategy == "fps":
video_dict["fps"] = fps
video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True)
frames, video_metadata = video_inputs
frames = frames.numpy()
return frames, video_metadata, sample_fps
else:
raise ValueError(f"Invalid video sampling strategy: {self.config.video_sampling_strategy}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants