Skip to content

Conversation

@achimnol
Copy link
Member

@achimnol achimnol commented Jul 27, 2025

Summary

Fixes segfaults/SIGABRT during Python interpreter shutdown caused by race conditions with orphaned tokio tasks.

Results:

  • 100% success rate across all Python versions (3.11-3.14t) and platforms
  • ✅ Upstream PR submitted: pyo3-async-runtimes#71

The Problem

From pyo3-async-runtimes#40:

  1. pyo3-async-runtimes stores runtime in OnceCell<Runtime> with no cleanup mechanism
  2. Tokio tasks remain queued after Python finalization begins
  3. Tasks attempt PyGILState_Release on finalizing interpreter → SIGABRT

Solution

We forked pyo3-async-runtimes to implement a dedicated runtime thread pattern (inspired by valkey-glide) that enables explicit shutdown coordination.

How It Works

The tokio runtime now lives in a dedicated thread. When cleanup_runtime() is called:

  1. Signals the runtime thread via Notify
  2. Runtime thread calls Runtime::shutdown_timeout(5s)
  3. cleanup_runtime() blocks until the runtime fully terminates
  4. Python finalization can now proceed safely

Usage

import asyncio
from etcd_client import Client, cleanup_runtime

async def main():
    async with Client(["localhost:2379"]) as client:
        await client.put("key", "value")
    
    # Cleanup BEFORE event loop shutdown
    cleanup_runtime()

asyncio.run(main())

Changes

pyo3-async-runtimes (forked)

  • Added RuntimeWrapper with dedicated runtime thread
  • Added request_shutdown(timeout_ms) for graceful shutdown
  • Added get_handle() as recommended API (deprecates get_runtime())

etcd-client-py

  • src/runtime.rs: Simplified to just cleanup_runtime() function
  • src/*.rs: Use pyo3_async_runtimes::tokio::future_into_py directly
  • tests/: Simplified stress tests, removed cross-library tests

Test Results

Platform Python 3.11 Python 3.12 Python 3.13 Python 3.14 Python 3.14t
Linux x86_64
Linux ARM64
macOS

Related

@achimnol achimnol force-pushed the topic/fix-crash-upon-python-shutdown branch from 8725525 to 9a118f8 Compare July 27, 2025 04:50
@achimnol
Copy link
Member Author

achimnol commented Jul 27, 2025

The test in GitHub Action passes, but it fails with SIGABRT or SIGSEGV in local environments.

@achimnol achimnol marked this pull request as ready for review January 3, 2026 15:32
Register _cleanup_runtime() with atexit in all subprocess test scripts
to ensure the tokio runtime is properly shut down before Python interpreter
exits. This prevents race conditions between background tasks and GIL
cleanup during interpreter shutdown.

The tests affected:
- test_shutdown_stress.py: All subprocess stress tests
- test_cross_library_shutdown.py: Cross-library and baseline tests

This change makes the tests explicitly wait for runtime shutdown,
matching the behavior of the production atexit handler.
- Added pyo3-async-runtimes as submodule with shutdown API patches
- Patched tokio.rs to add task tracking and shutdown coordination
- Updated Cargo.toml to use patched local submodule
- Rewrote src/runtime.rs to use new shutdown APIs
- Added Python wrapper to export _cleanup_runtime
- Configured maturin for mixed Python/Rust builds
- All tests pass (18 passed, 2 skipped)

This provides explicit control over tokio runtime shutdown,
preventing GIL state violations during Python interpreter exit.

Related: BA-1976, pyo3-async-runtimes#40
The pyo3-async-runtimes submodule was not being checked out during CI runs,
causing build failures. Added submodules: recursive to the checkout action.
- Forked pyo3-async-runtimes to lablup/pyo3-async-runtimes
- Created branch 'add-explicit-shutdown-api' with shutdown patches
- Updated submodule to point to lablup fork
- Submodule now at commit b8eb152 with explicit shutdown APIs

This allows us to maintain the patches in our own fork while
staying synchronized with upstream pyo3-async-runtimes.
Changed test scripts to call _cleanup_runtime() after asyncio.run()
instead of using atexit handlers. This tests whether explicit cleanup
after event loop shutdown but before process exit resolves the ARM64
Python 3.11-3.12 race condition (currently 1/20 iterations with SIGSEGV).

Changes:
- test_shutdown_stress.py: Updated _make_test_script() and all test scripts
- test_cross_library_shutdown.py: Updated script generation functions
- Removed unused asyncio import from test_shutdown_stress.py

Pattern: asyncio.run(main()) followed by _cleanup_runtime()
…side async functions

This corrects the timing of cleanup to happen BEFORE the event loop shuts down,
not after. The cleanup_runtime() function should be called at the end of the
main async function, while the event loop is still running.

Changes:
- Renamed _cleanup_runtime() to cleanup_runtime() as it's a public API
- Updated all test scripts to call cleanup_runtime() inside async functions
- Updated docstrings to reflect correct usage pattern
- Updated type stub with new function signature and documentation

Correct pattern:
```python
async def main():
    # Your etcd operations
    ...
    # Cleanup before returning
    cleanup_runtime()

asyncio.run(main())  # Event loop shutdown happens after cleanup
```
Updated pyo3-async-runtimes submodule to use tokio's built-in Runtime::shutdown_timeout()
for proper shutdown, removing manual task counting.

Changes in submodule (vendor/pyo3-async-runtimes):
- Removed manual ACTIVE_TASKS tracking
- Store runtime in Arc<Runtime> to allow taking ownership
- Use Arc::try_unwrap() + shutdown_timeout() for graceful shutdown
- Removed obsolete functions: task_started(), task_completed(), get_active_task_count()

Changes in parent repo (src/runtime.rs):
- Removed task tracking wrapper from spawn()
- Simplified Drop implementation
- Updated cleanup_runtime() to reflect tokio shutdown usage
- Updated documentation

This uses tokio's native task management instead of manual tracking.
Updates submodule to implement valkey-glide-style shutdown pattern with:
- TokioRuntimeWrapper using dedicated thread and Handle-based access
- AsyncStdRuntimeWrapper for API consistency
- New spawn/spawn_blocking/request_shutdown APIs
- Graceful shutdown via Drop-based cleanup

This enables clean runtime shutdown and should reduce segfaults during
Python finalization.
Updates pyo3-async-runtimes submodule to fix race condition where
request_shutdown() returned before the runtime thread completed.

## Key Changes in Submodule

- Runtime thread now calls shutdown_timeout() explicitly
- request_shutdown() blocks until thread.join() completes
- Uses channel to pass timeout value to runtime thread
- Interior mutability via Mutex for thread handle

## Result

- No more "shutdown completed" message before actual shutdown
- No more GIL state violations from async shutdown
- Tests pass reliably (18 passed, 2 skipped, 0 failed)
- Proper synchronous blocking shutdown behavior
Refactored code for upstream PR review:
- Add section comments for better code organization
- Remove debug eprintln! statements
- Add get_handle() as the recommended public API
- Improve documentation for all public functions
- Mark get_runtime() as deprecated with migration guidance

PR: PyO3/pyo3-async-runtimes#71
Since pyo3-async-runtimes now has built-in shutdown support, we no longer
need the EtcdRt wrapper struct. This commit:

- Removes EtcdRt struct, keeping only cleanup_runtime() function
- Calls pyo3_async_runtimes::tokio::future_into_py directly
- Removes unnecessary cross-library shutdown test (valkey-glide uses
  different architecture - process isolation, not embedded runtime)
- Simplifies stress tests to essential shutdown scenarios
@achimnol achimnol merged commit bd9e7f1 into main Jan 5, 2026
17 checks passed
@achimnol achimnol deleted the topic/fix-crash-upon-python-shutdown branch January 5, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants