-
Notifications
You must be signed in to change notification settings - Fork 590
Use custom plugin to download test data early #7169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use custom plugin to download test data early #7169
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Downloading data when using pytest-xdist can lead to races where different workers see different sized datasets. We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.
test_naive_bayes.py
failuresThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since it will resolve the immediate problem. We should keep an eye out for a more long-term solution, ideally one that fixes the root-cause within the sklearn code base.
We should reference scikit-learn/scikit-learn#32095 within the code at appropriate spots.
/merge |
Running this in CI as I can't reproduce the failure locallyThe problem we were seeing in the naive bayes tests was that some test functions saw only a subset of the dataset. As a result there was not at least one sample from every class in the dataset. The reason this happened is some kind of race in the downloading and processing of the data. This happens because we use more than one worker for
pytest-xdist
.We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.
The downside of this approach is that we need to manually list the datasets that get "pre-downloaded". I think that is Ok because we don't add new datasets frequently. But this could be improved.
An upside is that we only download each dataset once, not once per worker as we were doing so far.
More discussion and details in scikit-learn/scikit-learn#32095 - maybe it is possible to fix this at the level of scikit-learn. Which would be great as it would mean we can remove this plugin again.
xref #7152