Add a new test resource to mark flaky tests #130474

tomasr8 · 2025-02-22T22:16:19Z

Feature or enhancement

Proposal:

Came up in this issue #130363 and this comment: #130363 (comment)

How about a flaky resources. At least, we know which tests are meant to be flaky and which are not. And for the CI, we could ask the CI to rerun flaky tests on failures so that contributors don't need to wait for a triager/core dev to rerun a failed workflow when needed.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

gh-130474: Create and implement flaky resource #130489

encukou · 2025-02-24T08:04:08Z

I don't think we should have any tests that are meant to be flaky.
If it's OK to ignore a test, why not just delete it?

picnixz · 2025-02-24T14:30:24Z

If it's OK to ignore a test, why not just delete it?

In general, the flaky tests are those that are tests with threading, but now I wonder if it's actually better not to have a flaky resources but rather a marker that indicates that a test can be executed multiple times until it's considered to succeed on the CI.

We found some "flaky" references but in the end, they are rather scarce. So the argument "if it's flaky and we usually skip it", maybe it's better to remove them if we can't fix them (as @vstinner suggested for test_selflink)

encukou · 2025-02-24T16:20:57Z

now I wonder if it's actually better not to have a flaky resources but rather a marker that indicates that a test can be executed multiple times until it's considered to succeed on the CI.

Ah, that does sound useful! Basically, make --fail-rerun the default except for the tests marked "flaky"?

picnixz · 2025-02-24T16:45:45Z

Basically, make --fail-rerun the default except for the tests marked "flaky

Actually, not entirely. Make the tests run, and if some of them fail but are marked as flaky in the skip list (to borrow Victor's terminology) rerun them more than just once. Sometimes, even rerunning them once is not sufficient. In general, the CI fails on tests because of the system being busy (perhaps because of other tests).

The workflow I had in my head is:

Run all tests normally once.
Run all failed tests normally (that's what we do now I think?)
If some of the failed tests are known to be flaky, rerun them again (say 3-5 times).
Report the tests that really failed (namely non-flaky tests that failed in step 2 and flaky tests that failed after lots of repetitions).

Context: I'm a bit tired of hitting the "rerun" button everytime and/or having a notification telling me that an unrelated test has failed!

vstinner · 2025-02-24T16:53:55Z

Context: I'm a bit tired of hitting the "rerun" button everytime and/or having a notification telling me that an unrelated test has failed!

Which tests are failing randomly? If nobody fix them, the situation will not become better.

picnixz · 2025-02-24T17:13:22Z

In general, the tests that I need to rerun are the test.test_concurrent_futures* or test.test_multiprocessing* ones. Usually on the free-threaded build and sometimes on TSAN/ASAN.

For instance: https://github.com/python/cpython/actions/runs/13242348106/job/37023611044?pr=129175, or https://github.com/python/cpython/actions/runs/13225352402/job/36915529351?pr=129175 or https://github.com/python/cpython/actions/runs/13205748354/job/36868408015?pr=129785. Agreed that those are all free-threading related and/or threading related but I'm not sure we can actually fix them at once.

For instance, https://github.com/python/cpython/actions/runs/13205748354/job/36868408015?pr=129785#step:6:516 is a failure I've usually seen but maybe it can be fixed as the second issue is that the environment seems to have changed (which then results in an overall failure I think?)

Another example is https://github.com/python/cpython/actions/runs/13205748354/job/36868408015?pr=129785#step:6:516 where test.test_concurrent_futures.test_interpreter_pool fails. However, it could be because there are other tests that are running simultaneously that it's failing, or because we didn't wait enough time between two runs.

And another example is https://github.com/python/cpython/actions/runs/12927858394/job/36053953144?pr=127676#step:12:58 which is a socket test so could be flaky for whatever reason.

If you don't really like the idea of a skip list, I'll note which tests I'm rerunning when I do so that we can fix them. I should have done that sooner but it didn't occur to me that the tests should be really fixed (since they pass most of the time, like 95% or more, I was biased and assumed that they were just flaky and that's it).

vstinner · 2025-02-24T17:19:55Z

For instance: https://github.com/python/cpython/actions/runs/13242348106/job/37023611044?pr=129175

FAIL: test_4_daemon_threads (test.test_threading.ThreadJoinOnShutdown.test_4_daemon_threads)

This test is known to be flaky: issue gh-110052. It should be skipped.

or https://github.com/python/cpython/actions/runs/13225352402/job/36915529351?pr=129175

It looks like a real bug, you should open an issue:

0:04:15 load avg: 7.63 [20/23/2] test_free_threading worker non-zero exit code (Exit code -6 (SIGABRT)) -- running (2): test_socket (1 min 38 sec), test_threading (38.8 sec)
python: Objects/obmalloc.c:1219: void process_queue(struct llist_node *, struct _qsbr_thread_state *, _Bool, delayed_dealloc_cb, void *): Assertion `buf->rd_idx == buf->wr_idx' failed.
Fatal Python error: Aborted

<Cannot show all threads while the GIL is disabled>
Stack (most recent call first):
  File "/home/runner/work/cpython/cpython/Lib/test/test_free_threading/test_dict.py", line 184 in writer_func
  File "/home/runner/work/cpython/cpython/Lib/threading.py", line 996 in run
  File "/home/runner/work/cpython/cpython/Lib/threading.py", line 1054 in _bootstrap_inner
  File "/home/runner/work/cpython/cpython/Lib/threading.py", line 1016 in _bootstrap

Extension modules: _testinternalcapi, _testcapi (total: 2)

or https://github.com/python/cpython/actions/runs/13205748354/job/36868408015?pr=129785. Agreed that those are all free-threading related and/or threading related but I'm not sure we can actually fix them at once.

test_wrapped_exception (test.test_multiprocessing_spawn.test_threads.WithThreadsTestPool.test_wrapped_exception) ... Warning -- Uncaught thread exception: ValueError
(...)
ValueError: concurrent send_bytes() calls are not supported

It looks like a bug in the test.

picnixz · 2025-02-24T17:27:02Z

For

test_wrapped_exception (test.test_multiprocessing_spawn.test_threads.WithThreadsTestPool.test_wrapped_exception) ... Warning -- Uncaught thread exception: ValueError

I'm actually unsure that it's the actual error. The two tests that failed were rerun in https://github.com/python/cpython/actions/runs/13205748354/job/36868408015?pr=129785#step:6:1165 and it appears that they are related to conditions:

0:27:43 load avg: 0.00 Re-running 2 failed tests in verbose mode in subprocesses
0:27:43 load avg: 0.00 Run 2 tests in parallel using 2 worker processes (timeout: 10 min, worker timeout: 15 min)
0:27:49 load avg: 0.02 [1/2] test_threading passed
Re-running test_threading in verbose mode (matching: test_timeout)
test_timeout (test.test_threading.BarrierTests.test_timeout)
Test wait(timeout) ... ok
test_timeout (test.test_threading.CRLockTests.test_timeout) ... ok
test_timeout (test.test_threading.ConditionAsRLockTests.test_timeout) ... ok
test_timeout (test.test_threading.ConditionTests.test_timeout) ... ok
test_timeout (test.test_threading.EventTests.test_timeout) ... ok
test_timeout (test.test_threading.LockTests.test_timeout) ... ok
test_timeout (test.test_threading.PyRLockTests.test_timeout) ... ok

For the error you linked, I think it's a warning omission (https://github.com/python/cpython/actions/runs/13205748354/job/36868408015?pr=129785#step:6:1258):

test_wrapped_exception (test.test_multiprocessing_spawn.test_threads.WithThreadsTestPool.test_wrapped_exception) ... Warning -- Uncaught thread exception: ValueError

I'll try to remember opening an issue for python/cpython/actions/runs/13225352402/job/36915529351?pr=129175

tomasr8 added tests Tests in the Lib/test dir type-feature A feature request or enhancement labels Feb 22, 2025

tomasr8 mentioned this issue Feb 22, 2025

Annotate test cases which are timing sensitive #130363

Open

bedevere-app bot mentioned this issue Feb 23, 2025

gh-130474: Create and implement flaky resource #130489

Open

picnixz added the pending The issue will be closed if no feedback is provided label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new test resource to mark flaky tests #130474

Add a new test resource to mark flaky tests #130474

tomasr8 commented Feb 22, 2025 •

edited by bedevere-app bot

Loading

encukou commented Feb 24, 2025

picnixz commented Feb 24, 2025

encukou commented Feb 24, 2025

picnixz commented Feb 24, 2025 •

edited

Loading

vstinner commented Feb 24, 2025

picnixz commented Feb 24, 2025 •

edited

Loading

vstinner commented Feb 24, 2025

picnixz commented Feb 24, 2025 •

edited

Loading

Add a new test resource to mark flaky tests #130474

Add a new test resource to mark flaky tests #130474

Comments

tomasr8 commented Feb 22, 2025 • edited by bedevere-app bot Loading

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

encukou commented Feb 24, 2025

picnixz commented Feb 24, 2025

encukou commented Feb 24, 2025

picnixz commented Feb 24, 2025 • edited Loading

vstinner commented Feb 24, 2025

picnixz commented Feb 24, 2025 • edited Loading

vstinner commented Feb 24, 2025

picnixz commented Feb 24, 2025 • edited Loading

tomasr8 commented Feb 22, 2025 •

edited by bedevere-app bot

Loading

picnixz commented Feb 24, 2025 •

edited

Loading

picnixz commented Feb 24, 2025 •

edited

Loading

picnixz commented Feb 24, 2025 •

edited

Loading