-
Notifications
You must be signed in to change notification settings - Fork 25
Llava video #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Llava video #97
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a logic for loading parquet dataset here
lmms-engine/src/lmms_engine/utils/data_utils.py
Lines 65 to 70 in 4a9a3c7
| if data_type == "arrow": | |
| dataset = load_from_disk(path) | |
| elif data_type == "parquet": | |
| dataset = Dataset.from_parquet(path) | |
| else: | |
| dataset = DataUtilities.maybe_load_json_or_jsonlines_or_csv(path, data_type) |
wondering if changing the very top multimodal dataset will break this change. Since you override the self.load_from_hf. I think you can simply change the dataset_format to hf_dataset instead of creating a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May consider to put this logic into llava video dataset. Adding a very specific function to a high level abstract dataset may not be a good choice. I think can refer to the implementation to qwen3 vl that also need a custom video loading operation. Thanks!
lmms-engine/src/lmms_engine/datasets/iterable/qwen3_vl_iterable_dataset.py
Lines 54 to 99 in 4a9a3c7
| def load_videos(self, video_path: str, data_folder=None, fps: int = 1): | |
| assert ( | |
| self.config.video_backend == "qwen_vl_utils" | |
| ), "Qwen3VLIterableDataset only supports qwen_vl_utils backend" | |
| frames, video_metadata, sample_fps = self.load_video_qwen_vl_utils(video_path, fps) | |
| return frames, video_metadata, sample_fps | |
| def load_video_qwen_vl_utils( | |
| self, | |
| video_path: str, | |
| fps: int, | |
| ) -> Tuple[np.ndarray, float]: | |
| """ | |
| Load video using Qwen VL utils. | |
| Args: | |
| video_path: Path to video file | |
| fps: Target frames per second | |
| Returns: | |
| Tuple of (video frames, video metadata, sample fps) | |
| """ | |
| video_dict = { | |
| "type": "video", | |
| "video": f"file://{video_path}", | |
| "min_frames": 1, | |
| "max_pixels": self.config.video_max_pixels, | |
| "max_frames": self.config.video_max_frames, | |
| "min_pixels": self.config.video_min_pixels, | |
| } | |
| if self.config.video_sampling_strategy == "frame_num": | |
| n_frames = self.config.frame_num | |
| video_dict["nframes"] = n_frames | |
| video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True) | |
| frames, video_metadata = video_inputs | |
| frames = frames.numpy() | |
| return frames, video_metadata, sample_fps | |
| elif self.config.video_sampling_strategy == "fps": | |
| video_dict["fps"] = fps | |
| video_inputs, sample_fps = fetch_video(video_dict, return_video_sample_fps=True, return_video_metadata=True) | |
| frames, video_metadata = video_inputs | |
| frames = frames.numpy() | |
| return frames, video_metadata, sample_fps | |
| else: | |
| raise ValueError(f"Invalid video sampling strategy: {self.config.video_sampling_strategy}") |
Motivation
This PR adds comprehensive support for LLaVA-Video training to the LMMs-Engine framework. The main goals are:
This implementation is based on LLaVA-NeXT's video processing approach and is compatible with transformers v4.57.1.
Modifications
1. Video-Aware Model Forward Pass
New file:
src/lmms_engine/models/llava_onevision/llava_video_forward.py(315 lines)LlavaOnevisionModelwith slow-fast frame supportfaster_tokenparameter to mark frame typesbilinear,average, andmaxpooling modesapply_2d_pool()reshapes features to 2D spatial layout and applies efficient token reduction2. Model Monkey Patching Infrastructure
New file:
src/lmms_engine/models/llava_onevision/monkey_patch.py(146 lines)Liger Kernel Patch (
@MONKEY_PATCHER.register("llava_onevision", "liger")):Video Extension Patch (
@MONKEY_PATCHER.register("llava_onevision", "video")):faster_tokenparameterConfiguration example:
3. Video Data Processor
New file:
src/lmms_engine/datasets/processor/llava_video_processor.py(274 lines)<video>tokens to actual feature tokens(num_frames * pooled_height * pooled_width) + 1LlavaOnevisionProcessorpixel_values(images) andpixel_values_videos(videos)inject_time_instruction()method adds temporal context4. Video Dataset Implementation
New file:
src/lmms_engine/datasets/naive/llava_video_dataset.py(371 lines)Multiple Data Format Support:
Flexible Video Loading:
Time Metadata Extraction:
Configurable Frame Sampling:
Robust Error Handling: Gracefully skips corrupted videos and tries next sample
5. Video Loading Utilities
Modified:
src/lmms_engine/datasets/multimodal_mixin.py(+81 lines)load_video_with_time()method to extract video frames with temporal metadata6. Example Configuration
New file:
examples/llava_video/llava_video_qwen2_7b.yaml(257 lines)llava_videotypeCommit Message Convention
Please follow our standardized commit message format:
[feat]- New features or functionality[fix]- Bug fixes[docs]- Documentation changes only[style]- Code style changes (formatting, missing semicolons, etc.)[refactor]- Code refactoring without changing functionality[perf]- Performance improvements[test]- Adding or updating tests[chore]- Maintenance tasks, dependency updates, etc.[ci]- CI/CD configuration changesThis PR:
[feat] add llava-video support with slow-fast frame processingSee CONTRIBUTING.md for more details.
CI/CD Checks
Your PR will automatically run the following checks:
black(line-length=120) and import sorting withisortpre-commit run --all-fileslocally to verify before pushingChecklist
pre-commit run --all-filesand ensure all checks passblack(line-length=120) andisort