Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/18900.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Better explain how we manage the logcontext in `run_in_background(...)` and `run_as_background_process(...)`.
22 changes: 22 additions & 0 deletions docs/log_contexts.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,28 @@ def do_request_handling():
logger.debug("phew")
```

### The `sentinel` context

The default logcontext is `synapse.logging.context.SENTINEL_CONTEXT`, which is an empty
sentinel value to represent the root logcontext. This is what is used when there is no
other logcontext set. The phrase "clear/reset the logcontext" means to set the current
logcontext to the `sentinel` logcontext.

No CPU/database usage metrics are recorded against the `sentinel` logcontext.

Ideally, nothing from the Synapse homeserver would be logged against the `sentinel`
logcontext as we want to know which server the logs came from. In practice, this is not
always the case yet especially outside of request handling.

Global things outside of Synapse (e.g. Twisted reactor code) should run in the
`sentinel` logcontext. It's only when it calls into application code that a logcontext
gets activated. This means the reactor should be started in the `sentinel` logcontext,
and any time an awaitable yields control back to the reactor, it should reset the
logcontext to be the `sentinel` logcontext. This is important to avoid leaking the
current logcontext to the reactor (which would then get picked up and associated with
the next thing the reactor does).


## Using logcontexts with awaitables

Awaitables break the linear flow of code so that there is no longer a single entry point
Expand Down
144 changes: 107 additions & 37 deletions synapse/logging/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,16 @@ class ContextRequest:


class _Sentinel:
"""Sentinel to represent the root context"""
"""
Sentinel to represent the root context

This should only be used for tasks outside of Synapse like when we yield control
back to the Twisted reactor (event loop) so we don't leak the current logging
context to other tasks that are scheduled next in the event loop.

Nothing from the Synapse homeserver should be logged with the sentinel context. i.e.
we should always know which server the logs are coming from.
Comment on lines +237 to +238
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is aspirational since we currently go against this in a few places. This is being addressed in #18870

"""

__slots__ = ["previous_context", "finished", "request", "tag"]

Expand Down Expand Up @@ -616,9 +625,17 @@ def filter(self, record: logging.LogRecord) -> Literal[True]:


class PreserveLoggingContext:
"""Context manager which replaces the logging context
"""
Context manager which replaces the logging context

The previous logging context is restored on exit.

The previous logging context is restored on exit."""
`make_deferred_yieldable` is pretty equivalent to using `with
PreserveLoggingContext():` (using the default sentinel context), i.e. it clears the
logcontext before awaiting (and so before execution passes back to the reactor) and
restores the old context once the awaitable completes (execution passes from the
reactor back to the code).
"""

__slots__ = ["_old_context", "_new_context"]

Expand Down Expand Up @@ -784,6 +801,14 @@ def run_in_background(
return from the function, and that the sentinel context is set once the
deferred returned by the function completes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_in_background (or run_as_background_process if we're not going to later wait on the result)

-- @erikjohnston, https://github.com/element-hq/synapse/pull/17884/files#r1822356809

Is this distinction actually true?

It seems like both can used to wait for a result. Both need to be used with make_deferred_yieldable(...).

run_as_background_process just adds some metrics around what's being run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is:

  • run_in_background makes the function run with the context of the calling function. This then needs to be awaited on later within the same context to ensure tracing works correctly.
  • run_as_background_process starts the function with a new context, plus measures the resource usage

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the caller, they interact the same (as far as I can tell):

  • Both preserve the calling context
  • Both return a Deferred that does not follow the synapse logcontext rules. Which means they need to be used with make_deferred_yieldable to get access to the result.

The only difference is internal as you pointed out:

  • run_in_background makes the function run with the context of the calling function.
  • run_as_background_process starts the function with a new context, plus measures the resource usage

But I think this already answers the question in this thread that the off the cuff comment on the distinction wasn't accurate ⏩

To explain how the log contexts work here:
- When this function is called, the current context is stored ("original"), we kick
off the background task, and we restore that original context before returning
- When the background task finishes, we don't want to leak our context into the
reactor which would erroneously get attached to the next operation picked up by
the event loop. We add a callback to the deferred which will clear the logging
context after it finishes and yields control back to the reactor.

Useful for wrapping functions that return a deferred or coroutine, which you don't
yield or await on (for instance because you want to pass it to
deferred.gatherResults()).
Expand All @@ -795,8 +820,13 @@ def run_in_background(
`f` doesn't raise any deferred exceptions, otherwise a scary-looking
CRITICAL error about an unhandled error will be logged without much
indication about where it came from.

Returns:
Deferred which returns the result of func, or `None` if func raises.
Note that the returned Deferred does not follow the synapse logcontext
rules.
"""
current = current_context()
calling_context = current_context()
try:
res = f(*args, **kwargs)
except Exception:
Expand All @@ -806,6 +836,9 @@ def run_in_background(

# `res` may be a coroutine, `Deferred`, some other kind of awaitable, or a plain
# value. Convert it to a `Deferred`.
#
# Wrapping the value in a deferred has the side effect of executing the coroutine,
# if it is one. If it's already a deferred, then we can just use that.
d: "defer.Deferred[R]"
if isinstance(res, typing.Coroutine):
# Wrap the coroutine in a `Deferred`.
Expand All @@ -820,28 +853,32 @@ def run_in_background(
# `res` is a plain value. Wrap it in a `Deferred`.
d = defer.succeed(res)

# The deferred has already completed
if d.called and not d.paused:
# The function should have maintained the logcontext, so we can
# optimise out the messing about
return d

# The function may have reset the context before returning, so
# we need to restore it now.
ctx = set_current_context(current)

# The original context will be restored when the deferred
# completes, but there is nothing waiting for it, so it will
# get leaked into the reactor or some other function which
# wasn't expecting it. We therefore need to reset the context
# here.
# The function may have reset the context before returning, so we need to restore it
# now.
#
# Our goal is to have the caller logcontext unchanged after firing off the
# background task and returning.
set_current_context(calling_context)

# The original logcontext will be restored when the deferred completes, but
# there is nothing waiting for it, so it will get leaked into the reactor (which
# would then get picked up by the next thing the reactor does). We therefore
# need to reset the logcontext here (set the `sentinel` logcontext) before
# yielding control back to the reactor.
#
# (If this feels asymmetric, consider it this way: we are
# effectively forking a new thread of execution. We are
# probably currently within a ``with LoggingContext()`` block,
# which is supposed to have a single entry and exit point. But
# by spawning off another deferred, we are effectively
# adding a new exit point.)
d.addBoth(_set_context_cb, ctx)
d.addBoth(_set_context_cb, SENTINEL_CONTEXT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_in_background has been updated to match the docstring (set the sentinel context when the deferred completes) and makes sense to how things should work in my mind.

Can you come up with a reason why we were previously doing the trying to restore some other context?

This was introduced in matrix-org/synapse#3170 without an explanation around this specific piece.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmhmhmhmhmhm, I wonder if there is some awfulness with deferred chaining where the execution doesn't actually go back to the reactor but back to some other context.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have chained deferred tests that still pass from that PR 🤷

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give this an update with more hindsight; this logic was redundant with the early return and it is safe to remove this complexity ✅

It seems like this actually has to do with completed vs incomplete deferreds...

To explain how things previously worked without the early-return shortcut:

With the normal case of incomplete awaitable, we store the calling_context and the f function is called and runs until it yields to the reactor. Because f follows the logcontext rules, it sets the sentinel logcontext. Then in run_in_background(...), we restore the calling_context, store the current ctx (which is sentinel) and return. When the deferred completes, we restore ctx (which is sentinel) before yielding to the reactor again (all good ✅)

With the other case where we see a completed awaitable, we store the calling_context and the f function is called and runs to completion (no logcontext change). This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut. -- Then in run_in_background(...), we restore the calling_context, store the current ctx (which is same as the calling_context). Because the deferred is already completed, our extra callback is called immediately and we restore ctx (which is same as the calling_context). Since we never yield to the reactor, the calling_context is perfect as that's what we want again (all good ✅)


But this also means that our early-return shortcut is no longer just an optimization and is necessary to act correctly in the completed awaitable case as we want to return with the calling_context and not reset to the sentinel context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization.

But because we made the same change to run_coroutine_in_background(...) which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964

return d


Expand All @@ -859,53 +896,86 @@ def run_coroutine_in_background(
coroutine directly rather than a function. We can do this because coroutines
do not run until called, and so calling an async function without awaiting
cannot change the log contexts.

This is an ergonomic helper so we can do this:
```python
run_coroutine_in_background(func1(arg1))
```
Rather than having to do this:
```python
run_in_background(lambda: func1(arg1))
```
"""
calling_context = current_context()

current = current_context()
# Wrap the coroutine in a deferred, which will have the side effect of executing the
# coroutine in the background.
d = defer.ensureDeferred(coroutine)

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: run_coroutine_in_background: Why do we have this specialized version of run_in_background? Is this really a worthwhile shortcut?

This was introduced in #17884

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes it a bit more ergonomic to use the gather_optional_coroutines function, as you can do:

await gather_optional_coroutines(
    func1(arg1), func2(arg2), func3(arg3),
)

which otherwise you'd need to do something like:

await gather_optional(
    lambda: func1(arg1), lambda: func2(arg2), lambda: func3(arg3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ergonomic helper 👍

Updated the docstring to explain this ✅

# The function may have reset the context before returning, so
# we need to restore it now.
ctx = set_current_context(current)

# The original context will be restored when the deferred
# completes, but there is nothing waiting for it, so it will
# get leaked into the reactor or some other function which
# wasn't expecting it. We therefore need to reset the context
# here.
# The function may have reset the context before returning, so we need to restore it
# now.
#
# Our goal is to have the caller logcontext unchanged after firing off the
# background task and returning.
set_current_context(calling_context)

# The original logcontext will be restored when the deferred completes, but
# there is nothing waiting for it, so it will get leaked into the reactor (which
# would then get picked up by the next thing the reactor does). We therefore
# need to reset the logcontext here (set the `sentinel` logcontext) before
# yielding control back to the reactor.
#
# (If this feels asymmetric, consider it this way: we are
# effectively forking a new thread of execution. We are
# probably currently within a ``with LoggingContext()`` block,
# which is supposed to have a single entry and exit point. But
# by spawning off another deferred, we are effectively
# adding a new exit point.)
d.addBoth(_set_context_cb, ctx)
d.addBoth(_set_context_cb, SENTINEL_CONTEXT)
return d


T = TypeVar("T")


def make_deferred_yieldable(deferred: "defer.Deferred[T]") -> "defer.Deferred[T]":
"""Given a deferred, make it follow the Synapse logcontext rules:

If the deferred has completed, essentially does nothing (just returns another
completed deferred with the result/failure).

If the deferred has not yet completed, resets the logcontext before
returning a deferred. Then, when the deferred completes, restores the
current logcontext before running callbacks/errbacks.

(This is more-or-less the opposite operation to run_in_background.)
"""
Given a deferred, make it follow the Synapse logcontext rules:

- If the deferred has completed, essentially does nothing (just returns another
completed deferred with the result/failure).
- If the deferred has not yet completed, resets the logcontext before returning a
incomplete deferred. Then, when the deferred completes, restores the current
logcontext before running callbacks/errbacks.

This means the resultant deferred can be awaited without leaking the current
logcontext to the reactor (which would then get erroneously picked up by the next
thing the reactor does), and also means that the logcontext is preserved when the
deferred completes.

(This is more-or-less the opposite operation to run_in_background in terms of how it
handles log contexts.)

Pretty much equivalent to using `with PreserveLoggingContext():`, i.e. it clears the
logcontext before awaiting (and so before execution passes back to the reactor) and
restores the old context once the awaitable completes (execution passes from the
reactor back to the code).
"""
# The deferred has already completed
if deferred.called and not deferred.paused:
# it looks like this deferred is ready to run any callbacks we give it
# immediately. We may as well optimise out the logcontext faffery.
return deferred

# ok, we can't be sure that a yield won't block, so let's reset the
# logcontext, and add a callback to the deferred to restore it.
# Our goal is to have the caller logcontext unchanged after they yield/await the
# returned deferred.
#
# When the caller yield/await's the returned deferred, it may yield
# control back to the reactor. To avoid leaking the current logcontext to the
# reactor (which would then get erroneously picked up by the next thing the reactor
# does) while the deferred runs in the reactor event loop, we reset the logcontext
# and add a callback to the deferred to restore it so the caller's logcontext is
# active when the deferred completes.
Comment on lines +973 to +978
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more accurate than the previous explanation?

The "yield will block" scenario is coming from the situation described in the PR description of matrix-org/synapse#3170

So, it turns out that if you have a first Deferred D1, you can add a callback which returns another Deferred D2, and D2 must then complete before any further callbacks on D1 will execute (and later callbacks on D1 get the result of D2 rather than D2 itself).

So, D1 might have called=True (as in, it has started running its callbacks), but any new callbacks added to D1 won't get run until D2 completes - so if you yield D1 in an inlineCallbacks function, your yield will 'block'.

In conclusion: some of our assumptions in logcontext were invalid. We need to make sure that we don't optimise out the logcontext juggling when this situation happens. Fortunately, it is easy to detect by checking D1.paused.

I feel like this may be irrelevant to what I think is actually happening (as explained now) 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is related to the d.paused check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that detail might not even matter compared to this other reason why we would set sentinel here.

That detail might still be good to also include here but I have no confidence even with this scenario explanation and feels like the "why" half is missing which is what I would want filled in.

prev_context = set_current_context(SENTINEL_CONTEXT)
deferred.addBoth(_set_context_cb, prev_context)
return deferred
Expand Down
17 changes: 17 additions & 0 deletions synapse/metrics/background_process_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,11 @@ def run_as_background_process(
clock.looping_call and friends (or for firing-and-forgetting in the middle of a
normal synapse async function).

Because the returned Deferred does not follow the synapse logcontext rules, awaiting
the result of this function will result in the log context being cleared (bad). In
order to properly await the result of this function and maintain the current log
context, use `make_deferred_yieldable`.

Args:
desc: a description for this background process type
server_name: The homeserver name that this background process is being run for
Expand Down Expand Up @@ -280,6 +285,18 @@ async def run() -> Optional[R]:
name=desc, **{SERVER_NAME_LABEL: server_name}
).dec()

# To explain how the log contexts work here:
# - When this function is called, the current context is stored (using
# `PreserveLoggingContext`), we kick off the background task, and we restore the
# original context before returning (also part of `PreserveLoggingContext`).
# - When the background task finishes, we don't want to leak our background context
# into the reactor which would erroneously get attached to the next operation
# picked up by the event loop. We use `PreserveLoggingContext` to set the
# `sentinel` context and means the new `BackgroundProcessLoggingContext` will
# remember the `sentinel` context as its previous context to return to when it
# exits and yields control back to the reactor.
#
# TODO: Why can't we simplify to using `return run_in_background(run)`?
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to explore this further but I haven't come up with an answer yet.

Currently, some tests fail when we try this -> #18900 (comment)

It probably directly relates to #18900 (comment)

I think this PR is a good step forward and I'd rather leave the note in until we either comment "why" or simplify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its likely due to run_in_background starts the run() function with the calling logcontext, rather than the sentinel context it expects. So when BackgroundProcessLoggingContext exits it would set the context back to the calling context rather than the sentinel context, though it'd get reset back to sentinel context again due to run_in_background, I can imagine that entering the calling context is unexpected at that point

with PreserveLoggingContext():
# Note that we return a Deferred here so that it can be used in a
# looping_call and other places that expect a Deferred.
Expand Down
Loading