From f320d05af52035d0263d2fe947870fffe49390e9 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Thu, 5 Dec 2024 11:04:59 -0700
Subject: [PATCH 1/9] Expand docs on when and why allow_threads is necessary

---
 guide/src/free-threading.md | 42 ++++++++++++++++++++++-------
 guide/src/parallelism.md    | 54 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 85 insertions(+), 11 deletions(-)

diff --git a/guide/src/free-threading.md b/guide/src/free-threading.md
index f212cb0b9a9..7179c9e08fb 100644
--- a/guide/src/free-threading.md
+++ b/guide/src/free-threading.md
@@ -160,16 +160,38 @@ The main reason for obtaining a `'py` lifetime is to interact with Python
 objects or call into the CPython C API. If you are not yet attached to the
 Python runtime, you can register a thread using the [`Python::with_gil`]
 function. Threads created via the Python [`threading`] module do not not need to
-do this, but all other OS threads that interact with the Python runtime must
-explicitly attach using `with_gil` and obtain a `'py` liftime.
-
-Since there is no GIL in the free-threaded build, releasing the GIL for
-long-running tasks is no longer necessary to ensure other threads run, but you
-should still detach from the interpreter runtime using [`Python::allow_threads`]
-when doing long-running tasks that do not require the CPython runtime. The
-garbage collector can only run if all threads are detached from the runtime (in
-a stop-the-world state), so detaching from the runtime allows freeing unused
-memory.
+do this, and pyo3 will handle setting up the [`Python<'py>`] token when CPython
+calls into your extension, but all other OS threads that interact with the
+Python runtime must explicitly attach using `with_gil` and obtain a `'py`
+liftime.
+
+### Global synchronization events can cause hangs and deadlocks
+
+The free-threaded build triggers global synchronization events in the following
+situations:
+
+* During garbage collection in order to get a globally consistent view of
+  reference counts and references between objects
+* In Python 3.13, when the first background thread is started in
+  order to mark certain objects as immortal
+* When either `sys.settrace` or `sys.setprofile` are called in order to
+  instrument running code objects and threads
+* Before `os.fork()` is called.
+
+This is a non-exhaustive list and there may be other situations in future Python
+versions that can trigger global synchronization events.
+
+This means that you should detach from the interpreter runtime using
+[`Python::allow_threads`] in exactly the same situations as you should detach
+from the runtime in the GIL-enabled build: when doing long-running tasks that do
+not require the CPython runtime or when doing any task that needs to re-attach
+to the runtime (see the [guide
+section](parallelism.md#sharing-python-objects-between-rust-threads) that
+covers this). In the former case, you would observe a hang on threads that are
+waiting on the long-running task to complete, and in the latter case you would
+see a deadlock while a thread tries to attach after the runtime triggers a
+global synchronization event, but the spawning thread prevents the
+synchronization event from completing.
 
 ### Exceptions and panics for multithreaded access of mutable `pyclass` instances
 
diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index a288b14be19..eef396afa70 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -1,6 +1,6 @@
 # Parallelism
 
-CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) tasks and often forces developers to accept the overhead of multiprocessing.
+CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock) (GIL), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) tasks and often forces developers to accept the overhead of multiprocessing. There is an experimental "free-threaded" version of CPython 3.13 that does not have a GIL, see the PyO3 docs on [free-threaded Python](./free-threading.md) for more information about that.
 
 In PyO3 parallelism can be easily achieved in Rust-only code. Let's take a look at our [word-count](https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs) example, where we have a `search` function that utilizes the [rayon](https://github.com/rayon-rs/rayon) crate to count words in parallel.
 ```rust,no_run
@@ -117,4 +117,56 @@ test_word_count_python_sequential                      27.3985 (15.82)    45.452
 
 You can see that the Python threaded version is not much slower than the Rust sequential version, which means compared to an execution on a single CPU core the speed has doubled.
 
+## Sharing Python objects between Rust threads
+
+In the example above we made a Python interface to a low-level rust function,
+and then leveraged the python `threading` module to run the low-level function
+in parallel. It is also possible to spawn threads in Rust that acquire the GIL
+and operate on Python objects. However, care must be taken to avoid writing code
+that deadlocks with the GIL in these cases.
+
+In the example below, we share a `vec` of User ID objects defined using the
+`pyclass` macro and spawn threads to process the collection of data into a `vec`
+of booleans based on a predicate using a rayon parallel iterator:
+
+```rust,no_run
+use pyo3::prelude::*;
+
+// These traits let us use int_par_iter and map
+use rayon::iter::{IntoParallelIterator, ParallelIterator};
+
+#[pyclass]
+struct UserID {
+    id: i64,
+}
+
+let instances: Vec<Py<UserID>> = Python::with_gil(|py| {
+    (0..10).map(|x| Py::new(py, UserID { id: x }).unwrap()).collect()
+});
+let allowed_ids: Vec<bool> = Python::with_gil(|outer_py| {
+    outer_py.allow_threads(|| {
+        (0..instances.len()).into_par_iter().map(|index| {
+            Python::with_gil(|inner_py| {
+                instances[index].borrow(inner_py).id > 5
+            })
+        }).collect()
+    })
+});
+assert!(allowed_ids.into_iter().filter(|b| *b).count() == 4);
+```
+
+It's important to note that there is an `outer_py` GIL lifetime token as well as
+an `inner_py` token. Sharing GIL lifetime tokens between threads is not allowed
+and threads must individually acquire the GIL to access data wrapped by a python
+object.
+
+It's also important to see that this example uses [`Python::allow_threads`] to
+wrap the code that spawns OS threads via `rayon`. If this example didn't use
+`allow_threads`, a rayon worker thread would block on acquiring the GIL while a
+thread that owns the GIL spins forever waiting for the result of the rayon
+thread. Calling `allow_threads` allows the GIL to be released in the thread
+collecting the results from the worker threads. You should always call
+`allow_threads` in situations that spawn worker threads, but especially so in
+cases where worker threads need to acquire the GIL to prevent deadlocks.
+
 [`Python::allow_threads`]: {{#PYO3_DOCS_URL}}/pyo3/marker/struct.Python.html#method.allow_threads

From 4eee1e42ad81c375feae554c12601bd72ab03779 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Thu, 5 Dec 2024 11:15:19 -0700
Subject: [PATCH 2/9] spelling

---
 guide/src/parallelism.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index eef396afa70..5ccbb2d0ed5 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -125,8 +125,8 @@ in parallel. It is also possible to spawn threads in Rust that acquire the GIL
 and operate on Python objects. However, care must be taken to avoid writing code
 that deadlocks with the GIL in these cases.
 
-In the example below, we share a `vec` of User ID objects defined using the
-`pyclass` macro and spawn threads to process the collection of data into a `vec`
+In the example below, we share a `Vec` of User ID objects defined using the
+`pyclass` macro and spawn threads to process the collection of data into a `Vec`
 of booleans based on a predicate using a rayon parallel iterator:
 
 ```rust,no_run

From 8ae85e67cad03f3675d29bee2dbd7bc7b0114f80 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Thu, 5 Dec 2024 11:18:16 -0700
Subject: [PATCH 3/9] simplify example a little

---
 guide/src/parallelism.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index 5ccbb2d0ed5..2629e17b611 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -140,10 +140,8 @@ struct UserID {
     id: i64,
 }
 
-let instances: Vec<Py<UserID>> = Python::with_gil(|py| {
-    (0..10).map(|x| Py::new(py, UserID { id: x }).unwrap()).collect()
-});
 let allowed_ids: Vec<bool> = Python::with_gil(|outer_py| {
+    let instances: Vec<Py<UserID>> = (0..10).map(|x| Py::new(outer_py, UserID { id: x }).unwrap()).collect();
     outer_py.allow_threads(|| {
         (0..instances.len()).into_par_iter().map(|index| {
             Python::with_gil(|inner_py| {

From a70698ee3ce54e8622934e507f3a820bfdc241d7 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Thu, 5 Dec 2024 11:21:27 -0700
Subject: [PATCH 4/9] use less indirection in the example

---
 guide/src/parallelism.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index 2629e17b611..fb8e0c25a16 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -133,7 +133,7 @@ of booleans based on a predicate using a rayon parallel iterator:
 use pyo3::prelude::*;
 
 // These traits let us use int_par_iter and map
-use rayon::iter::{IntoParallelIterator, ParallelIterator};
+use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
 
 #[pyclass]
 struct UserID {
@@ -143,9 +143,9 @@ struct UserID {
 let allowed_ids: Vec<bool> = Python::with_gil(|outer_py| {
     let instances: Vec<Py<UserID>> = (0..10).map(|x| Py::new(outer_py, UserID { id: x }).unwrap()).collect();
     outer_py.allow_threads(|| {
-        (0..instances.len()).into_par_iter().map(|index| {
+        instances.par_iter().map(|instance| {
             Python::with_gil(|inner_py| {
-                instances[index].borrow(inner_py).id > 5
+                instance.borrow(inner_py).id > 5
             })
         }).collect()
     })

From a5ace87c7b6e0635f9a47706801af766f5ce6e9b Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Thu, 5 Dec 2024 12:27:42 -0700
Subject: [PATCH 5/9] Update guide/src/parallelism.md

---
 guide/src/parallelism.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index fb8e0c25a16..4f567765060 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -165,6 +165,6 @@ thread that owns the GIL spins forever waiting for the result of the rayon
 thread. Calling `allow_threads` allows the GIL to be released in the thread
 collecting the results from the worker threads. You should always call
 `allow_threads` in situations that spawn worker threads, but especially so in
-cases where worker threads need to acquire the GIL to prevent deadlocks.
+cases where worker threads need to acquire the GIL, to prevent deadlocks.
 
 [`Python::allow_threads`]: {{#PYO3_DOCS_URL}}/pyo3/marker/struct.Python.html#method.allow_threads

From b71aceff1aa506f983a01c4f72630a5e91cb0a4b Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Wed, 11 Dec 2024 09:35:44 -0700
Subject: [PATCH 6/9] Add note about the GIL preventing parallelism

---
 guide/src/parallelism.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index 4f567765060..ed366abd759 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -129,6 +129,12 @@ In the example below, we share a `Vec` of User ID objects defined using the
 `pyclass` macro and spawn threads to process the collection of data into a `Vec`
 of booleans based on a predicate using a rayon parallel iterator:
 
+* Note: This example is meant to illustrate how to drop and re-acquire the GIL
+        to avoid creating deadlocks. Unless the spawned threads subsequently
+        release the GIL or you are using the free-threaded build of CPython, you
+        will not see any speedups due to multi-threaded parallelism using `rayon`
+        to parallelize code that acquires the GIL.
+
 ```rust,no_run
 use pyo3::prelude::*;
 

From 2fc22e95bbd5ce0f73124263d8574513853cd8b4 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Fri, 13 Dec 2024 10:20:55 -0700
Subject: [PATCH 7/9] Update guide/src/free-threading.md

Co-authored-by: Bruno Kolenbrander <59372212+mejrs@users.noreply.github.com>
---
 guide/src/free-threading.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/guide/src/free-threading.md b/guide/src/free-threading.md
index 7179c9e08fb..d6d406a9a25 100644
--- a/guide/src/free-threading.md
+++ b/guide/src/free-threading.md
@@ -163,7 +163,7 @@ function. Threads created via the Python [`threading`] module do not not need to
 do this, and pyo3 will handle setting up the [`Python<'py>`] token when CPython
 calls into your extension, but all other OS threads that interact with the
 Python runtime must explicitly attach using `with_gil` and obtain a `'py`
-liftime.
+lifetime.
 
 ### Global synchronization events can cause hangs and deadlocks
 

From 6f2eb15fb0c61fb8d3f1fe5486245fc18cab8740 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Fri, 13 Dec 2024 10:28:34 -0700
Subject: [PATCH 8/9] pared down text about need to use with_gil

---
 guide/src/free-threading.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/guide/src/free-threading.md b/guide/src/free-threading.md
index d6d406a9a25..e3e1fa8fbf2 100644
--- a/guide/src/free-threading.md
+++ b/guide/src/free-threading.md
@@ -156,14 +156,12 @@ freethreaded build, holding a `'py` lifetime means only that the thread is
 currently attached to the Python interpreter -- other threads can be
 simultaneously interacting with the interpreter.
 
-The main reason for obtaining a `'py` lifetime is to interact with Python
+You still need to obtain a `'py` lifetime is to interact with Python
 objects or call into the CPython C API. If you are not yet attached to the
 Python runtime, you can register a thread using the [`Python::with_gil`]
 function. Threads created via the Python [`threading`] module do not not need to
 do this, and pyo3 will handle setting up the [`Python<'py>`] token when CPython
-calls into your extension, but all other OS threads that interact with the
-Python runtime must explicitly attach using `with_gil` and obtain a `'py`
-lifetime.
+calls into your extension.
 
 ### Global synchronization events can cause hangs and deadlocks
 

From aaaf0a99bec7cb61c6c502bfa6a796eb319c2321 Mon Sep 17 00:00:00 2001
From: Nathan Goldbaum <nathan.goldbaum@gmail.com>
Date: Fri, 13 Dec 2024 10:54:31 -0700
Subject: [PATCH 9/9] rearrange slightly

---
 guide/src/parallelism.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
index ed366abd759..64ff1c8c9c0 100644
--- a/guide/src/parallelism.md
+++ b/guide/src/parallelism.md
@@ -125,15 +125,16 @@ in parallel. It is also possible to spawn threads in Rust that acquire the GIL
 and operate on Python objects. However, care must be taken to avoid writing code
 that deadlocks with the GIL in these cases.
 
-In the example below, we share a `Vec` of User ID objects defined using the
-`pyclass` macro and spawn threads to process the collection of data into a `Vec`
-of booleans based on a predicate using a rayon parallel iterator:
-
 * Note: This example is meant to illustrate how to drop and re-acquire the GIL
         to avoid creating deadlocks. Unless the spawned threads subsequently
         release the GIL or you are using the free-threaded build of CPython, you
         will not see any speedups due to multi-threaded parallelism using `rayon`
-        to parallelize code that acquires the GIL.
+        to parallelize code that acquires and holds the GIL for the entire
+        execution of the spawned thread.
+
+In the example below, we share a `Vec` of User ID objects defined using the
+`pyclass` macro and spawn threads to process the collection of data into a `Vec`
+of booleans based on a predicate using a rayon parallel iterator:
 
 ```rust,no_run
 use pyo3::prelude::*;