Speed up sliding sync by computing extensions in parallel #17884

erikjohnston · 2024-10-29T11:05:11Z

The main change here is to add a helper function gather_optional_coroutines, which works in a similar way as yieldable_gather_results but takes a set of coroutines rather than a function

MadLittleMods

General idea looks good 👍

Surprised we need extra code

MadLittleMods · 2024-10-29T19:19:26Z

synapse/util/async_helpers.py

+async def gather_optional_coroutines(
+    *coroutines: Unpack[Tuple[Optional[Coroutine[Any, Any, T1]], ...]],
+) -> Tuple[Optional[T1], ...]:


I'm surprised this sort of thing doesn't exist in the API? Like Promise.all(...) in JavaScript.

Is it because we can't use asyncio yet?

If it's just a feature-difference on how None is handled, can we change them from optional coroutines to Promise.resolve(None) to wrap it in a coroutine (however best to do that in Python) so we're always working with a coroutine?

The reason we need to do this manually (rather than using built-ins), is that we need to correctly handle log contexts.

As a quick summary: basically whenever we "fork off" execution of some async code we need to use run_in_background (or run_as_background_process if we're not going to later wait on the result), and when we "join" them again we need to use make_deferred_yieldable. This is basically what gather_optional_coroutines is doing: spawning the coroutines as background tasks, and then calling make_deferred_yieldable when waiting on them all to complete

This context would be useful to read in the docstrings. I see log contexts are mentioned but none of the consequences or why we need to or why we care is included.

Conversation continued in #18900 (comment)

@MadLittleMods

#18964) Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`

@MadLittleMods

#18964) Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`

erikjohnston added 6 commits October 29, 2024 10:10

Run extensions in parallel

206f5a5

Fixup

0e633d6

Use *args

91b9e2b

Add docstring

0163116

Add tests

7d8df67

Newsfile

6e697b9

erikjohnston added the A-Sync label Oct 29, 2024

erikjohnston marked this pull request as ready for review October 29, 2024 11:19

erikjohnston requested a review from a team as a code owner October 29, 2024 11:19

MadLittleMods approved these changes Oct 29, 2024

View reviewed changes

erikjohnston merged commit 83513b7 into develop Oct 30, 2024
39 checks passed

erikjohnston deleted the erikj/ss_extensions_in_parallel branch October 30, 2024 10:51

MadLittleMods mentioned this pull request Sep 8, 2025

Better explain logcontext in run_in_background(...), run_as_background_process(...), and the sentinel logcontext #18900

Merged

4 tasks

MadLittleMods mentioned this pull request Sep 23, 2025

Fix run_coroutine_in_background(...) incorrectly handling logcontext #18964

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up sliding sync by computing extensions in parallel #17884

Speed up sliding sync by computing extensions in parallel #17884

Uh oh!

erikjohnston commented Oct 29, 2024

Uh oh!

MadLittleMods left a comment

Uh oh!

MadLittleMods Oct 29, 2024

Uh oh!

erikjohnston Oct 30, 2024

Uh oh!

MadLittleMods Oct 30, 2024

Uh oh!

MadLittleMods Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Speed up sliding sync by computing extensions in parallel #17884

Speed up sliding sync by computing extensions in parallel #17884

Uh oh!

Conversation

erikjohnston commented Oct 29, 2024

Uh oh!

MadLittleMods left a comment

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

erikjohnston Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!