Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race-condition on shutdown of current-thread runtime? #7056

Open
thomaseizinger opened this issue Dec 30, 2024 · 8 comments
Open

Race-condition on shutdown of current-thread runtime? #7056

thomaseizinger opened this issue Dec 30, 2024 · 8 comments
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request. M-runtime Module: tokio/runtime

Comments

@thomaseizinger
Copy link

Is your feature request related to a problem? Please describe.

In my application, I am spawning threads that use tokio's current-thread runtime. Only a single task is spawned onto this runtime via block_on. The remaining application is connected with this task via channels. When the application is shut down, there appears to be a race condition between cleaning up the runtime and polling the task that is spawned onto the runtime. In particular, my task ends up spamming "A Tokio 1.x context was found, but it is being shutdown." from polling the AsyncFd within that task.

This is the task that is running inside the runtime: https://github.com/firezone/firezone/blob/90bac881945ebe4f91f812672da2633cfe3c1079/rust/tun/src/unix.rs#L39-L107

What I don't understand is: How can the runtime shut down while the task is still being polled? I would expect that either the runtime shuts down and de-allocates the task or the runtime is still active and polls the task. Here is an example of where the thread is being spawned:

https://github.com/firezone/firezone/blob/90bac881945ebe4f91f812672da2633cfe3c1079/rust/connlib/clients/apple/src/tun.rs#L33-L53

Describe the solution you'd like

  • Runtime shutdown to be atomic, i.e. a task running within block_on to never be polled if the runtime is being shut down
  • A way of detecting runtime shutdown from within the task

Describe alternatives you've considered

The way I am solving this is by checking for the message of the io::Error (see firezone/firezone#7605). That isn't very clean and could break if that message every changes. It is additional API surface but perhaps a function could be exposed that matches an io::Error against it being the "shut down"-error? Or a function on Handle to detect whether the runtime is shutting down?

Additional context
Add any other context or screenshots about the feature request here.

@thomaseizinger thomaseizinger added A-tokio Area: The main tokio crate C-feature-request Category: A feature request. labels Dec 30, 2024
@thomaseizinger thomaseizinger changed the title Race-condition on shutdown of current-thread runtime Race-condition on shutdown of current-thread runtime? Dec 30, 2024
@Darksonn Darksonn added the M-runtime Module: tokio/runtime label Dec 30, 2024
@Darksonn
Copy link
Contributor

The way you shut down the runtime is by running the destructor of the Runtime. That cannot possibly happen with the send_recv_tun function since you're inside a call to block_on on the Runtime object. Are you sure that send_recv_tun is the function where this panic happens?

Are you perhaps using tokio::runtime::Handle::block_on anywhere?

@thomaseizinger
Copy link
Author

Are you sure that send_recv_tun is the function where this panic happens?

There is no panic! Instead, the AsyncFd returns an io::Error with that error text.

@thomaseizinger
Copy link
Author

thomaseizinger commented Dec 30, 2024

Are you perhaps using tokio::runtime::Handle::block_on anywhere?

Not on that runtime. The linked async fn is the only thing that runs on that runtime.

@Darksonn
Copy link
Contributor

Sorry, I meant the error not panic. But I still don't see how you are shutting down the runtime without first exiting from the block_on call. Where do you trigger shutdown?

@Darksonn
Copy link
Contributor

Oh. I understand now.

The problem is that the AsyncFd is using the IO driver of the runtime it was created in. It stops working when that other runtime shuts down. But the future is executing under a different runtime that did not get shut down.

You should wait with creating the AsyncFd until you're inside the new runtime.

@thomaseizinger
Copy link
Author

thomaseizinger commented Dec 30, 2024

Oh. I understand now.

The problem is that the AsyncFd is using the IO driver of the runtime it was created in. It stops working when that other runtime shuts down. But the future is executing under a different runtime that did not get shut down.

You should wait with creating the AsyncFd until you're inside the new runtime.

Oh! I didn't realise that is what is happening. Thanks for pointing that out :)

@thomaseizinger
Copy link
Author

This is a bit bizarre. Perhaps there should be some kind of warning about that? If each runtime where to carry a unique identifier, this could potentially be detected?

@Darksonn
Copy link
Contributor

We could probably add some logic to detect this when creating the error message, so that we could emit a better error for this case.

github-merge-queue bot pushed a commit to firezone/firezone that referenced this issue Jan 5, 2025
Reading and writing to the TUN device within `connlib` happens in a
separate thread. The task running within these threads is connected to
the rest of `connlib` via channels. When the application shuts down,
these threads also need to exit. Currently, we attempt to detect this
from within the task when these channels close. It appears that there is
a race condition here because we first attempt to read from the TUN
device before reading from the channels. We treat read & write errors on
the TUN device as non-fatal so we loop around and attempt to read from
it again, causing an infinite-loop and log spam.

To fix this, we swap the order in which we evaluate the two concurrent
tasks: The first task to be polled is now the channel for outbound
packets and only if that one is empty, we attempt to read new packets
from the TUN device. This is also better from a backpressure point of
view: We should attempt to flush out our local buffers of already
processed packets before taking on "new work".

As a defense-in-depth strategy, we also attempt to detect the particular
error from the tokio runtime when it is being shut down and exit the
task.

Resolves: #7601.
Related: tokio-rs/tokio#7056.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request. M-runtime Module: tokio/runtime
Projects
None yet
Development

No branches or pull requests

2 participants