-
Notifications
You must be signed in to change notification settings - Fork 393
Speed up sliding sync by computing extensions in parallel #17884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General idea looks good 👍
Surprised we need extra code
async def gather_optional_coroutines( | ||
*coroutines: Unpack[Tuple[Optional[Coroutine[Any, Any, T1]], ...]], | ||
) -> Tuple[Optional[T1], ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this sort of thing doesn't exist in the API? Like Promise.all(...)
in JavaScript.
Is it because we can't use asyncio
yet?
If it's just a feature-difference on how None
is handled, can we change them from optional coroutines to Promise.resolve(None)
to wrap it in a coroutine (however best to do that in Python) so we're always working with a coroutine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need to do this manually (rather than using built-ins), is that we need to correctly handle log contexts.
As a quick summary: basically whenever we "fork off" execution of some async code we need to use run_in_background
(or run_as_background_process
if we're not going to later wait on the result), and when we "join" them again we need to use make_deferred_yieldable
. This is basically what gather_optional_coroutines
is doing: spawning the coroutines as background tasks, and then calling make_deferred_yieldable
when waiting on them all to complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This context would be useful to read in the docstrings. I see log contexts are mentioned but none of the consequences or why we need to or why we care is included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversation continued in #18900 (comment)
#18964) Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`
#18964) Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`
The main change here is to add a helper function
gather_optional_coroutines
, which works in a similar way asyieldable_gather_results
but takes a set of coroutines rather than a function