Skip to content

Conversation

betatim
Copy link
Member

@betatim betatim commented Sep 3, 2025

Running this in CI as I can't reproduce the failure locally

The problem we were seeing in the naive bayes tests was that some test functions saw only a subset of the dataset. As a result there was not at least one sample from every class in the dataset. The reason this happened is some kind of race in the downloading and processing of the data. This happens because we use more than one worker for pytest-xdist.

We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.

The downside of this approach is that we need to manually list the datasets that get "pre-downloaded". I think that is Ok because we don't add new datasets frequently. But this could be improved.

An upside is that we only download each dataset once, not once per worker as we were doing so far.

More discussion and details in scikit-learn/scikit-learn#32095 - maybe it is possible to fix this at the level of scikit-learn. Which would be great as it would mean we can remove this plugin again.

xref #7152

@betatim betatim requested review from a team as code owners September 3, 2025 08:58
@github-actions github-actions bot added Cython / Python Cython or Python issue ci labels Sep 3, 2025
@csadorf csadorf marked this pull request as draft September 3, 2025 15:47
Copy link

copy-pr-bot bot commented Sep 3, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Downloading data when using pytest-xdist can lead to races where
different workers see different sized datasets. We work around this
problem by defining a custom plugin that runs before workers start and
downloads all datasets.
@github-actions github-actions bot removed the ci label Sep 3, 2025
@betatim betatim changed the title Debugging test_naive_bayes.py failures Use custom plugin to download test data early Sep 3, 2025
@betatim betatim marked this pull request as ready for review September 3, 2025 16:05
@csadorf csadorf added bug Something isn't working non-breaking Non-breaking change labels Sep 3, 2025
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since it will resolve the immediate problem. We should keep an eye out for a more long-term solution, ideally one that fixes the root-cause within the sklearn code base.

We should reference scikit-learn/scikit-learn#32095 within the code at appropriate spots.

@jameslamb jameslamb removed the request for review from AyodeAwe September 3, 2025 16:20
@csadorf
Copy link
Contributor

csadorf commented Sep 3, 2025

/merge

@rapids-bot rapids-bot bot merged commit 7180290 into rapidsai:branch-25.10 Sep 3, 2025
93 checks passed
@betatim betatim deleted the naive-bayes-debugging branch September 4, 2025 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants