Use custom plugin to download test data early #7169

betatim · 2025-09-03T08:58:30Z

~~Running this in CI as I can't reproduce the failure locally~~

The problem we were seeing in the naive bayes tests was that some test functions saw only a subset of the dataset. As a result there was not at least one sample from every class in the dataset. The reason this happened is some kind of race in the downloading and processing of the data. This happens because we use more than one worker for pytest-xdist.

We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.

The downside of this approach is that we need to manually list the datasets that get "pre-downloaded". I think that is Ok because we don't add new datasets frequently. But this could be improved.

An upside is that we only download each dataset once, not once per worker as we were doing so far.

More discussion and details in scikit-learn/scikit-learn#32095 - maybe it is possible to fix this at the level of scikit-learn. Which would be great as it would mean we can remove this plugin again.

xref #7152

copy-pr-bot · 2025-09-03T15:47:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Downloading data when using pytest-xdist can lead to races where different workers see different sized datasets. We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.

python/cuml/tests/conftest.py

csadorf

Approving since it will resolve the immediate problem. We should keep an eye out for a more long-term solution, ideally one that fixes the root-cause within the sklearn code base.

We should reference scikit-learn/scikit-learn#32095 within the code at appropriate spots.

csadorf · 2025-09-03T17:19:19Z

/merge

Debug

c708507

betatim requested review from a team as code owners September 3, 2025 08:58

betatim requested review from AyodeAwe and viclafargue September 3, 2025 08:58

github-actions bot added Cython / Python Cython or Python issue ci labels Sep 3, 2025

github-actions bot assigned betatim Sep 3, 2025

Debug

a86359f

csadorf marked this pull request as draft September 3, 2025 15:47

Add pytest plugin to download test data before workers

ff66d56

Downloading data when using pytest-xdist can lead to races where different workers see different sized datasets. We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.

github-actions bot removed the ci label Sep 3, 2025

betatim changed the title ~~Debugging test_naive_bayes.py failures~~ Use custom plugin to download test data early Sep 3, 2025

betatim marked this pull request as ready for review September 3, 2025 16:05

csadorf reviewed Sep 3, 2025

View reviewed changes

python/cuml/tests/conftest.py Show resolved Hide resolved

csadorf added bug Something isn't working non-breaking Non-breaking change labels Sep 3, 2025

csadorf approved these changes Sep 3, 2025

View reviewed changes

jameslamb approved these changes Sep 3, 2025

View reviewed changes

jameslamb removed the request for review from AyodeAwe September 3, 2025 16:20

KyleFromNVIDIA approved these changes Sep 3, 2025

View reviewed changes

rapids-bot bot merged commit 7180290 into rapidsai:branch-25.10 Sep 3, 2025
93 checks passed

csadorf mentioned this pull request Sep 3, 2025

Flaky test failures due to race condition in fetch_20newsgroups with pytest-xdist #7152

Closed

jameslamb mentioned this pull request Sep 3, 2025

Build and test with CUDA 13.0.0 #7128

Merged

betatim deleted the naive-bayes-debugging branch September 4, 2025 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use custom plugin to download test data early #7169

Use custom plugin to download test data early #7169

Uh oh!

betatim commented Sep 3, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Sep 3, 2025

Uh oh!

Uh oh!

csadorf left a comment •

edited

Loading

Uh oh!

csadorf commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Use custom plugin to download test data early #7169

Use custom plugin to download test data early #7169

Uh oh!

Conversation

betatim commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Sep 3, 2025

Uh oh!

Uh oh!

csadorf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csadorf commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

betatim commented Sep 3, 2025 •

edited

Loading

csadorf left a comment •

edited

Loading