Skip to content

Conversation

hollermay
Copy link

Introduces a drop_last flag to DataLoaderBuilder and FixBatchStrategy, allowing incomplete batches to be optionally dropped during data loading. Updates the builder API and batch strategy logic to support this feature, improving flexibility for batch processing.

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Related Issues/PRs

Fixes issue: DataLoader yields as many iterations as num_workers instead of correct batch count; no drop_last support (see #3316)

Changes

Problem:
The DataLoader previously yielded one batch per worker per epoch, regardless of batch size or dataset size, leading to incorrect iteration counts. There was also no way to drop incomplete batches, unlike PyTorch’s DataLoader.

Solution:

  • Introduced a drop_last flag to DataLoaderBuilder and FixBatchStrategy.
  • Updated the builder API to allow users to set drop_last.
  • Modified the batch strategy logic so that, when drop_last is true, incomplete batches are dropped.
  • Refactored the multithreaded DataLoader to use a shared batch queue among workers, ensuring the number of iterations per epoch is determined by dataset size and batch size, not by num_workers.
  • This brings the DataLoader’s behavior in line with PyTorch and improves flexibility for batch processing.

Testing

  • Ran cargo test --workspace --all-features to ensure all tests pass.
  • Manually verified that:
    • The number of iterations per epoch matches ceil(dataset_size / batch_size) for various num_workers values.
    • The drop_last flag correctly drops incomplete batches when enabled.
    • Changing num_workers only affects parallelism, not batch count.

Introduces a drop_last flag to DataLoaderBuilder and FixBatchStrategy, allowing incomplete batches to be optionally dropped during data loading. Updates the builder API and batch strategy logic to support this feature, improving flexibility for batch processing.
@hollermay
Copy link
Author

@laggui kindly review

@laggui
Copy link
Member

laggui commented Aug 1, 2025

Will take a look today!

@laggui laggui self-requested a review August 1, 2025 12:28
@hollermay
Copy link
Author

Any updates?

Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review!

I think the solution to correctly splitting the batches in the multi-threaded data loader can be much simpler.

See my comments below 🙂

@hollermay hollermay requested a review from laggui August 9, 2025 17:08
Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for applying the name change 🙂

I think the changes to the multithreaded dataloader .iter() are no longer required with the fixed batch split. And my previous comment for the batch strategy still apply.

@hollermay
Copy link
Author

Hey @laggui sorry for late reply! I was a little off my work, made the changes with strategy.rs. I think it must be resolved by now

@hollermay hollermay requested a review from laggui August 22, 2025 08:51
Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @laggui sorry for late reply! I was a little off my work, made the changes with strategy.rs. I think it must be resolved by now

No problem. What about the dataloader .iter() changes (as per my last comments)? These are no longer required with #3476 I believe.

P.S: CI failed on the formatting

@hollermay
Copy link
Author

Hey @laggui sorry for late reply! I was a little off my work, made the changes with strategy.rs. I think it must be resolved by now

No problem. What about the dataloader .iter() changes (as per my last comments)? These are no longer required with #3476 I believe.

P.S: CI failed on the formatting

You are right it's no longer required with #3476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants