Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: shuffle data on the generated HF splits #74

Merged
merged 1 commit into from
May 8, 2024

Conversation

eloy-encord
Copy link
Contributor

Generated splits used the data in the input order, which caused skewed datasets, e.g. containing only one class because the last 15% of the train dataset only had samples from that class.

Shuffling the data when generating the splits is enough to prevent the dependency from the input order when managing Hugging Face datasets.

Generated splits used the data in the input order, which caused skewed datasets, e.g. containing only one class because the last 15% of the train dataset only had samples from one class. Shuffling the data when generating the splits is enough to prevent this dependency from the input order in Hugging Face datasets.
@eloy-encord eloy-encord merged commit 910bf36 into main May 8, 2024
1 check passed
@eloy-encord eloy-encord deleted the eloy/chore-ensure-shuffle-generated-hf-splits branch May 8, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants