Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_free_threading.test_dict on TSAN/free-threading is flaky #130519

Open
picnixz opened this issue Feb 24, 2025 · 4 comments
Open

test_free_threading.test_dict on TSAN/free-threading is flaky #130519

picnixz opened this issue Feb 24, 2025 · 4 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@picnixz
Copy link
Member

picnixz commented Feb 24, 2025

Bug report

Bug description:

We have:

0:04:15 load avg: 7.63 [20/23/2] test_free_threading worker non-zero exit code (Exit code -6 (SIGABRT)) -- running (2): test_socket (1 min 38 sec), test_threading (38.8 sec)
python: Objects/obmalloc.c:1219: void process_queue(struct llist_node *, struct _qsbr_thread_state *, _Bool, delayed_dealloc_cb, void *): Assertion `buf->rd_idx == buf->wr_idx' failed.
Fatal Python error: Aborted

<Cannot show all threads while the GIL is disabled>
Stack (most recent call first):
  File "/home/runner/work/cpython/cpython/Lib/test/test_free_threading/test_dict.py", line 184 in writer_func
  File "/home/runner/work/cpython/cpython/Lib/threading.py", line 996 in run
  File "/home/runner/work/cpython/cpython/Lib/threading.py", line 1054 in _bootstrap_inner
  File "/home/runner/work/cpython/cpython/Lib/threading.py", line 1016 in _bootstrap

Extension modules: _testinternalcapi, _testcapi (total: 2)

I'm not sure if it's a real bug or not. Victor suggested me to open an issue for this one. Maybe there's already one that exists though.

See https://github.com/python/cpython/actions/runs/13225352402/job/36915529351?pr=129175#step:12:44 for the log (hopefully it will stay).

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

Linked PRs

@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error labels Feb 24, 2025
@sergey-miryanov
Copy link
Contributor

I'm a bit stuck with my last PR. Do you want me to look into this?

@picnixz
Copy link
Member Author

picnixz commented Feb 24, 2025

You can investigate if you want but maybe free-threading experts have an idea of why it fails.

cc @ZeroIntensity maybe you've encountered this beforehand?

@ZeroIntensity
Copy link
Member

I haven't seen it. I think there's a thread somewhere where you're supposed to report flaky TSan failures.

@colesbury
Copy link
Contributor

colesbury commented Feb 25, 2025

This is a good place to report it, thanks.

I think the problem is that free_work_item can be re-entrant because it calls Py_DECREF(). Originally, the QSBR code just made calls to PyMem_Free() or PyObject_Free(), which are never re-entrant. In:

We extended it to also executing Py_DECREF() on objects, which can execute arbitrary code via destructors, including re-entering QSBR (possibly during a GC).

This invalidates some assumptions. For example, the head of the work queue can change during a call to free_work_item:

cpython/Objects/obmalloc.c

Lines 1258 to 1268 in c5f925c

struct _mem_work_chunk *buf = work_queue_first(head);
while (buf->rd_idx < buf->wr_idx) {
struct _mem_work_item *item = &buf->array[buf->rd_idx];
if (!_Py_qsbr_poll(qsbr, item->qsbr_goal)) {
return;
}
free_work_item(item->ptr, cb, state);
buf->rd_idx++;
}

cc @DinoV

@colesbury colesbury self-assigned this Feb 25, 2025
colesbury added a commit to colesbury/cpython that referenced this issue Feb 25, 2025
colesbury added a commit to colesbury/cpython that referenced this issue Feb 25, 2025
The `free_work_item()` function in QSBR may call arbitrary code via
Python object destructors, which may reenter the QSBR code. Reorder
the processing of work items to be robust to reentrancy.

Also fix the TODO for the out of memory situation.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 25, 2025
The `free_work_item()` function in QSBR may call arbitrary code via
Python object destructors, which may reenter the QSBR code. Reorder
the processing of work items to be robust to reentrancy.

Also fix the TODO for the out of memory situation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants