-
Notifications
You must be signed in to change notification settings - Fork 695
Add drop_last option to DataLoader batching #3448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces a drop_last flag to DataLoaderBuilder and FixBatchStrategy, allowing incomplete batches to be optionally dropped during data loading. Updates the builder API and batch strategy logic to support this feature, improving flexibility for batch processing.
@laggui kindly review |
Will take a look today! |
Any updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review!
I think the solution to correctly splitting the batches in the multi-threaded data loader can be much simpler.
See my comments below 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for applying the name change 🙂
I think the changes to the multithreaded dataloader .iter()
are no longer required with the fixed batch split. And my previous comment for the batch strategy still apply.
Hey @laggui sorry for late reply! I was a little off my work, made the changes with strategy.rs. I think it must be resolved by now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right it's no longer required with #3476 |
Introduces a drop_last flag to DataLoaderBuilder and FixBatchStrategy, allowing incomplete batches to be optionally dropped during data loading. Updates the builder API and batch strategy logic to support this feature, improving flexibility for batch processing.
Pull Request Template
Checklist
cargo run-checks
command has been executed.Related Issues/PRs
Fixes issue: DataLoader yields as many iterations as num_workers instead of correct batch count; no drop_last support (see #3316)
Changes
Problem:
The DataLoader previously yielded one batch per worker per epoch, regardless of batch size or dataset size, leading to incorrect iteration counts. There was also no way to drop incomplete batches, unlike PyTorch’s DataLoader.
Solution:
drop_last
flag toDataLoaderBuilder
andFixBatchStrategy
.drop_last
.drop_last
is true, incomplete batches are dropped.num_workers
.Testing
cargo test --workspace --all-features
to ensure all tests pass.ceil(dataset_size / batch_size)
for variousnum_workers
values.drop_last
flag correctly drops incomplete batches when enabled.num_workers
only affects parallelism, not batch count.