Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Sharding to Equal Sized Shards #6940

Open
yuvalkirstain opened this issue May 31, 2024 · 0 comments
Open

Enable Sharding to Equal Sized Shards #6940

yuvalkirstain opened this issue May 31, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@yuvalkirstain
Copy link

yuvalkirstain commented May 31, 2024

Feature request

Add an option when sharding a dataset to have all shards the same size. Will be good to provide both an option of duplication, and by truncation.

Motivation

Currently the behavior of sharding is "If n % i == l, then the first l shards will have length (n // i) + 1, and the remaining shards will have length (n // i).". However, when using FSDP we want the shards to have the same size. This requires the user to manually handle this situation, but it will be nice if we had an option to shard the dataset into equally sized shards.

Your contribution

For now just a PR. I can also add code that does what is needed, but probably not efficient.
Shard to equal size by duplication:

remainder = len(dataset) % num_shards
num_missing_examples = num_shards - remainder
duplicated = dataset.select(list(range(num_missing_examples)))
dataset = concatenate_datasets([dataset, duplicated])
shard = dataset.shard(num_shards, shard_idx)

Or by truncation:

shard = dataset.shard(num_shards, shard_idx)
num_examples_per_shard = len(dataset) // num_shards
shard = shard.select(list(range(num_examples_per_shard)))
@yuvalkirstain yuvalkirstain added the enhancement New feature or request label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant