Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Expand docs on when and why allow_threads is necessary #4767

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
42 changes: 31 additions & 11 deletions guide/src/free-threading.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,20 +156,40 @@ freethreaded build, holding a `'py` lifetime means only that the thread is
currently attached to the Python interpreter -- other threads can be
simultaneously interacting with the interpreter.

The main reason for obtaining a `'py` lifetime is to interact with Python
You still need to obtain a `'py` lifetime is to interact with Python
objects or call into the CPython C API. If you are not yet attached to the
Python runtime, you can register a thread using the [`Python::with_gil`]
function. Threads created via the Python [`threading`] module do not not need to
do this, but all other OS threads that interact with the Python runtime must
explicitly attach using `with_gil` and obtain a `'py` liftime.

Since there is no GIL in the free-threaded build, releasing the GIL for
long-running tasks is no longer necessary to ensure other threads run, but you
should still detach from the interpreter runtime using [`Python::allow_threads`]
when doing long-running tasks that do not require the CPython runtime. The
garbage collector can only run if all threads are detached from the runtime (in
a stop-the-world state), so detaching from the runtime allows freeing unused
memory.
do this, and pyo3 will handle setting up the [`Python<'py>`] token when CPython
calls into your extension.

### Global synchronization events can cause hangs and deadlocks

The free-threaded build triggers global synchronization events in the following
situations:

* During garbage collection in order to get a globally consistent view of
reference counts and references between objects
* In Python 3.13, when the first background thread is started in
order to mark certain objects as immortal
* When either `sys.settrace` or `sys.setprofile` are called in order to
instrument running code objects and threads
* Before `os.fork()` is called.

This is a non-exhaustive list and there may be other situations in future Python
versions that can trigger global synchronization events.

This means that you should detach from the interpreter runtime using
[`Python::allow_threads`] in exactly the same situations as you should detach
from the runtime in the GIL-enabled build: when doing long-running tasks that do
not require the CPython runtime or when doing any task that needs to re-attach
to the runtime (see the [guide
section](parallelism.md#sharing-python-objects-between-rust-threads) that
covers this). In the former case, you would observe a hang on threads that are
waiting on the long-running task to complete, and in the latter case you would
see a deadlock while a thread tries to attach after the runtime triggers a
global synchronization event, but the spawning thread prevents the
synchronization event from completing.

### Exceptions and panics for multithreaded access of mutable `pyclass` instances

Expand Down
59 changes: 58 additions & 1 deletion guide/src/parallelism.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Parallelism

CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) tasks and often forces developers to accept the overhead of multiprocessing.
CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock) (GIL), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) tasks and often forces developers to accept the overhead of multiprocessing. There is an experimental "free-threaded" version of CPython 3.13 that does not have a GIL, see the PyO3 docs on [free-threaded Python](./free-threading.md) for more information about that.

In PyO3 parallelism can be easily achieved in Rust-only code. Let's take a look at our [word-count](https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs) example, where we have a `search` function that utilizes the [rayon](https://github.com/rayon-rs/rayon) crate to count words in parallel.
```rust,no_run
Expand Down Expand Up @@ -117,4 +117,61 @@ test_word_count_python_sequential 27.3985 (15.82) 45.452

You can see that the Python threaded version is not much slower than the Rust sequential version, which means compared to an execution on a single CPU core the speed has doubled.

## Sharing Python objects between Rust threads

In the example above we made a Python interface to a low-level rust function,
and then leveraged the python `threading` module to run the low-level function
in parallel. It is also possible to spawn threads in Rust that acquire the GIL
and operate on Python objects. However, care must be taken to avoid writing code
that deadlocks with the GIL in these cases.

* Note: This example is meant to illustrate how to drop and re-acquire the GIL
to avoid creating deadlocks. Unless the spawned threads subsequently
release the GIL or you are using the free-threaded build of CPython, you
will not see any speedups due to multi-threaded parallelism using `rayon`
to parallelize code that acquires and holds the GIL for the entire
execution of the spawned thread.

In the example below, we share a `Vec` of User ID objects defined using the
`pyclass` macro and spawn threads to process the collection of data into a `Vec`
of booleans based on a predicate using a rayon parallel iterator:

```rust,no_run
use pyo3::prelude::*;

// These traits let us use int_par_iter and map
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};

#[pyclass]
struct UserID {
id: i64,
}

let allowed_ids: Vec<bool> = Python::with_gil(|outer_py| {
let instances: Vec<Py<UserID>> = (0..10).map(|x| Py::new(outer_py, UserID { id: x }).unwrap()).collect();
outer_py.allow_threads(|| {
instances.par_iter().map(|instance| {
Python::with_gil(|inner_py| {
instance.borrow(inner_py).id > 5
})
}).collect()
})
});
assert!(allowed_ids.into_iter().filter(|b| *b).count() == 4);
```

It's important to note that there is an `outer_py` GIL lifetime token as well as
an `inner_py` token. Sharing GIL lifetime tokens between threads is not allowed
and threads must individually acquire the GIL to access data wrapped by a python
object.

It's also important to see that this example uses [`Python::allow_threads`] to
wrap the code that spawns OS threads via `rayon`. If this example didn't use
`allow_threads`, a rayon worker thread would block on acquiring the GIL while a
thread that owns the GIL spins forever waiting for the result of the rayon
thread. Calling `allow_threads` allows the GIL to be released in the thread
collecting the results from the worker threads. You should always call
`allow_threads` in situations that spawn worker threads, but especially so in
cases where worker threads need to acquire the GIL, to prevent deadlocks.

[`Python::allow_threads`]: {{#PYO3_DOCS_URL}}/pyo3/marker/struct.Python.html#method.allow_threads
Loading