Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tests more deterministic #17008

Open
wants to merge 30 commits into
base: branch-24.12
Choose a base branch
from

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Oct 7, 2024

Description

Fixes #17045

This PR removes randomness in our pytests and switches from using np.random.seed to np.random.default_rng in all of the codebase.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Oct 7, 2024
@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Oct 7, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Oct 7, 2024
@mroeschke
Copy link
Contributor

If interested, I included this pre-commit check in pandas to check that default_rng usage is always seeded (admittedly someone could still pass seed=None to get the unseeded behavior)

https://github.com/pandas-dev/pandas/blob/main/.pre-commit-config.yaml#L210-L211

@vyasr
Copy link
Contributor

vyasr commented Oct 14, 2024

I like the idea of a pygrep pre-commit check. We could also look out for np.random.seed().

@@ -245,7 +245,7 @@ def hash_vocab(
"""
Write the vocab vocabulary hashtable to the output_path
"""
np.random.seed(1243342)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VibhuJawa Switching to default_rng changes the random values generated and thus updates to python/cudf/cudf/tests/data/subword_tokenizer_data/bert_base_cased_sampled/vocab-hash.txt have been made, will that be an issue for tokenizer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think it should be , will have to test with some common vocabulary files to be sure

@galipremsagar galipremsagar marked this pull request as ready for review October 14, 2024 19:35
@galipremsagar galipremsagar requested review from a team as code owners October 14, 2024 19:35
@galipremsagar galipremsagar self-assigned this Oct 14, 2024
Copy link
Member

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are still two remaining appearance of np.random.seed in notebooks/performance-comparisons/performance-comparisons.ipynb and docs/cudf/source/user_guide/performance-comparisons/performance-comparisons.ipynb - should those also get updated?

pyproject.toml Show resolved Hide resolved
@@ -95,6 +95,18 @@ repos:
entry: 'pytest\.xfail'
language: pygrep
types: [python]
- id: no-unseeded-default-rng
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a preference to keep these checks separated? Seems like we could do something similar to the check that @mroeschke linked in consolidating all of these entries into a single check that runs against all Python files at once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I quite like the separate entries, since the name of the entry gives some information as to what went wrong. If the regex gets complicated, I find it hard to see what's going on. But I don't have strong feelings here.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Prem! I had a few (likely nonblocking) comments, and I think I spotted a few places where the rng is still unseeded in tests. Overall this looks good though, thanks!

@@ -95,6 +95,18 @@ repos:
entry: 'pytest\.xfail'
language: pygrep
types: [python]
- id: no-unseeded-default-rng
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I quite like the separate entries, since the name of the entry gives some information as to what went wrong. If the regex gets complicated, I find it hard to see what's going on. But I don't have strong feelings here.

python/cudf/cudf/testing/dataset_generator.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_binops.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_binops.py Outdated Show resolved Hide resolved
np.random.seed(12)
rng = np.random.default_rng(seed=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it doesn't really matter, but why the change of seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't actually matter keeping the seed same either:

In [1]: import numpy as np

In [2]: np.random.seed(0)

In [3]: np.random.randint(0, 10)
Out[3]: 5

In [4]: np.random.default_rng(0).integers(0, 10).item()
Out[4]: 8

python/cudf/cudf/utils/hash_vocab_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf_pandas_tests/test_cudf_pandas.py Outdated Show resolved Hide resolved
python/cudf/cudf_pandas_tests/test_profiler.py Outdated Show resolved Hide resolved
@galipremsagar
Copy link
Contributor Author

@charlesbluca @wence- I addressed all your reviews. This should be ready for review now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Make cudf tests deterministic
7 participants