-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race-condition on shutdown of current-thread runtime? #7056
Comments
The way you shut down the runtime is by running the destructor of the Are you perhaps using |
There is no panic! Instead, the |
Not on that runtime. The linked |
Sorry, I meant the error not panic. But I still don't see how you are shutting down the runtime without first exiting from the |
Oh. I understand now. The problem is that the You should wait with creating the |
Oh! I didn't realise that is what is happening. Thanks for pointing that out :) |
This is a bit bizarre. Perhaps there should be some kind of warning about that? If each runtime where to carry a unique identifier, this could potentially be detected? |
We could probably add some logic to detect this when creating the error message, so that we could emit a better error for this case. |
Reading and writing to the TUN device within `connlib` happens in a separate thread. The task running within these threads is connected to the rest of `connlib` via channels. When the application shuts down, these threads also need to exit. Currently, we attempt to detect this from within the task when these channels close. It appears that there is a race condition here because we first attempt to read from the TUN device before reading from the channels. We treat read & write errors on the TUN device as non-fatal so we loop around and attempt to read from it again, causing an infinite-loop and log spam. To fix this, we swap the order in which we evaluate the two concurrent tasks: The first task to be polled is now the channel for outbound packets and only if that one is empty, we attempt to read new packets from the TUN device. This is also better from a backpressure point of view: We should attempt to flush out our local buffers of already processed packets before taking on "new work". As a defense-in-depth strategy, we also attempt to detect the particular error from the tokio runtime when it is being shut down and exit the task. Resolves: #7601. Related: tokio-rs/tokio#7056.
Is your feature request related to a problem? Please describe.
In my application, I am spawning threads that use tokio's current-thread runtime. Only a single task is spawned onto this runtime via
block_on
. The remaining application is connected with this task via channels. When the application is shut down, there appears to be a race condition between cleaning up the runtime and polling the task that is spawned onto the runtime. In particular, my task ends up spamming "A Tokio 1.x context was found, but it is being shutdown." from polling theAsyncFd
within that task.This is the task that is running inside the runtime: https://github.com/firezone/firezone/blob/90bac881945ebe4f91f812672da2633cfe3c1079/rust/tun/src/unix.rs#L39-L107
What I don't understand is: How can the runtime shut down while the task is still being polled? I would expect that either the runtime shuts down and de-allocates the task or the runtime is still active and polls the task. Here is an example of where the thread is being spawned:
https://github.com/firezone/firezone/blob/90bac881945ebe4f91f812672da2633cfe3c1079/rust/connlib/clients/apple/src/tun.rs#L33-L53
Describe the solution you'd like
block_on
to never be polled if the runtime is being shut downDescribe alternatives you've considered
The way I am solving this is by checking for the message of the
io::Error
(see firezone/firezone#7605). That isn't very clean and could break if that message every changes. It is additional API surface but perhaps a function could be exposed that matches anio::Error
against it being the "shut down"-error? Or a function onHandle
to detect whether the runtime is shutting down?Additional context
Add any other context or screenshots about the feature request here.The text was updated successfully, but these errors were encountered: