Skip to content

Decode sampled tuple frames only#12

Open
AbdelStark wants to merge 1 commit intoNetflix:mainfrom
AbdelStark:selective-tuple-video-decode
Open

Decode sampled tuple frames only#12
AbdelStark wants to merge 1 commit intoNetflix:mainfrom
AbdelStark:selective-tuple-video-decode

Conversation

@AbdelStark
Copy link
Copy Markdown

@AbdelStark AbdelStark commented Apr 8, 2026

What changed

This PR replaces eager full-video decoding in the video_mask_tuple training path with selective frame decoding.

  • add a shared loader for tuple-backed samples in videox_fun/utils/video_tuple_loader.py
  • compute batch_index first, then decode only the requested frames for rgb_full.mp4, rgb_removed.mp4, mask.mp4, and optional depth_removed.mp4
  • apply the same path to both dataset_image_video.py and dataset_image_video_warped.py
  • preserve the existing trimask / quadmask quantization, depth handling, and PNG-directory fallback behavior

Why

The training datasets were decoding whole tuple videos and only then subselecting the clip used for the batch. On long sequences that wastes CPU, RAM, and disk I/O on frames the model never sees.

Selective decode keeps the released VOID data path the same from the model perspective, but removes avoidable host-side work from the loader.

Impact

  • lower host memory pressure during training
  • less CPU and disk work per sampled batch
  • better headroom for dataloader parallelism on long tuple videos

Validation

  • python3 -m py_compile videox_fun/utils/video_tuple_loader.py videox_fun/data/dataset_image_video.py videox_fun/data/dataset_image_video_warped.py
  • exercised the PNG-directory fallback with a stubbed smoke test to confirm sampled-frame loading and output shapes

I did not run a training job and not even inference in this environment.

@AbdelStark AbdelStark marked this pull request as ready for review April 8, 2026 13:24
@JVSCHANDRADITHYA
Copy link
Copy Markdown

selective decoding makes a lot of sense for longer sequences.

But edge case issues like temporal coherence, or alignment with masks, might only appear during inference/training, if it's broken.

@AbdelStark
Copy link
Copy Markdown
Author

selective decoding makes a lot of sense for longer sequences.

But edge case issues like temporal coherence, or alignment with masks, might only appear during inference/training, if it's broken.

Ok no problem, i understand. thanks for your time reviewing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants