Skip to content

Conversation

@jstuczyn
Copy link
Contributor

@jstuczyn jstuczyn commented Sep 5, 2025

this PR moves all of monorepo shutdown watchers from the old TaskManager into the 'new' ShutdownManager that internally relies on battle-tested tokio's CancellationToken and TaskTracker instead of our own TaskClient creation.

I have tried to incorporate as many spawned tasks as possible within out monorepo into relevant ShutdownManager trackers, but that wasn't possible in every case without some time consuming refactoring. for example adding that feature inside the GatewayClient would require too much refactoring which I reckon is a bit useless given it will be replaced soon enough. similar thing is true for any deeply nested tasks - exposing relevant handles would have been too time consuming.

Being a shutdown-only replacement for TaskManager, it only manages shutdowns, i.e. the broadcast SentStatus messages are no longer working.

TaskManager furthermore allows a little bit more flexibility in what signals should trigger the shutdown. it is possible for you to register custom shutdown signals with with_shutdown(...) method making it, hopefully, more easily to incorporate it into bigger, VPN-like, systems.

it also optionally (but enabled by default) triggers shutdown if a panic is detected so hopefully we should no longer be in a situation where a binary is limping around because some internal task has failed without us realising. There is, however, one limitation associated with it. on panic we'll have to wait for the shutdown timeout as the task that has panicked will not stop gracefully (well, duh : )).

As for API changes, I tried to make them as minimal as possible, but some breaking changes were inevitable, this includes using ShutdownTracker or ShutdownToken in place of TaskClient for any custom_shutdown builder method.

What is also worth mentioning, ShutdownToken by default will NOT cause global shutdown if dropped, unlike the old TaskClient. to achieve the same behaviour use ShutdownDropGuard instead (internally uses tokio's DropGuard)

Usage

you use it like you'd use old TaskManager, e.g.

let shutdown_manager = ShutdownManager::build_new_default()?;

let token = shutdown_manager.clone_shutdown_token();
tokio::spawn(async move { your_task.run(token).await });

shutdown_manager.run_until_shutdown().await;

HOWEVER, when possible, it's preferred to spawn them on ShutdownManager's tracker, so that we could wait for them during shutdown to ensure they stop gracefully:

let shutdown_manager = ShutdownManager::build_new_default()?;

let token = shutdown_manager.clone_shutdown_token();
shutdown_manager.spawn(async move { your_task.run(token).await });

shutdown_manager.run_until_shutdown().await;

furthermore, if you don't care about your task getting interrupted during cancellation (i.e. it doesn't do any important work that has to complete atomically), you can make ShutdownManager manage cancellation signals for you:

let shutdown_manager = ShutdownManager::build_new_default()?;

shutdown_manager.spawn_with_shutdown(async move { your_task.run().await });

shutdown_manager.run_until_shutdown().await;

and if you want to use it to the fullest potential, use try_spawn_named and try_spawn_named_with_shutdown to provide more debugging information.

ShutdownToken can be created in two ways, either via .clone_shutdown_token() or via .child_shutdown_token(). the difference is quite subtle.
child_token creates a ShutdownToken which will get cancelled whenever the current token gets cancelled. However, if the child gets cancelled, the parent will be unaffacted. In contrast, in a cloned ShutdownToken, cancelling any of the tokens will cause both of them to get cancelled. if in doubt, use the clone variant to have the same old behaviour.

for intermediate migration, ShutdownManager has support for creating legacy TaskClient. You'd do it as follows:

let shutdown_manager = ShutdownManager::build_new_default()?
	.with_legacy_task_manager();

let task_client = shutdown_manager.subscribe_legacy("subtask");

This change is Reviewable

@jstuczyn jstuczyn requested a review from neacsu September 5, 2025 12:33
@vercel
Copy link

vercel bot commented Sep 5, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
nym-explorer-v2 Ready Ready Preview Comment Sep 10, 2025 0:31am
nym-node-status Ready Ready Preview Comment Sep 10, 2025 0:31am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs-nextra Ignored Ignored Preview Sep 10, 2025 0:31am

@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from ab73645 to 171de90 Compare September 5, 2025 14:53
@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from 3763876 to fc5a078 Compare September 5, 2025 17:59
@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from fc5a078 to 43e12eb Compare September 5, 2025 18:16
@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from 43e12eb to 34b2334 Compare September 5, 2025 18:25
@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from 34b2334 to 8fa58ae Compare September 8, 2025 08:19
Base automatically changed from feature/nym-api-cancellation to develop September 8, 2025 08:45
@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from 8fa58ae to cdfec82 Compare September 8, 2025 08:47
@jstuczyn
Copy link
Contributor Author

common/client-core/src/client/cover_traffic_stream.rs line 249 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Should this not be wrapped into some sort of cancel_token.run_until_cancelled?

it is wrapped when it's spawned via

        shutdown_tracker
            .try_spawn_named_with_shutdown(async move { stream.run().await }, "CoverTrafficStream");

which internally does use run_until_cancelled

@jstuczyn
Copy link
Contributor Author

common/client-core/src/client/received_buffer.rs line 517 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Maybe move this where second break; happens due to channel being closed?

yes, good idea (and now I see exactly what you meant in that zulip message 😅 )

@jstuczyn
Copy link
Contributor Author

common/client-core/src/client/received_buffer.rs line 558 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Interesting that int his piece you don't wait for cancellation like in the instance above.

I probably just gone in an auto mode and if the previous iteration was using wait_with_delay, I'd added wait for cancellation

@jstuczyn
Copy link
Contributor Author

common/client-core/src/client/real_messages_control/acknowledgement_control/sent_notification_listener.rs line 43 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Should this not use something like run_until_cancelled()?

similarly to cover traffic stream, shutdown is being handled by the entity who spawns the task

@jstuczyn
Copy link
Contributor Author

common/client-core/src/client/topology_control/mod.rs line 158 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Need to respect cancellation?

as before, handled by the client spawning the task : )

@jstuczyn
Copy link
Contributor Author

common/client-core/src/client/topology_control/mod.rs line 160 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Need to respect cancellation like it used to?

ibid.

@jstuczyn
Copy link
Contributor Author

common/client-libs/gateway-client/src/client/mod.rs line 635 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

But... we used to receive those....

yeah : (
we'll need to get some replacement going ASAP. I think/hope it will be better due to proper separation of concern

@jstuczyn
Copy link
Contributor Author

common/node-tester-utils/src/receiver.rs line 98 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Should you not break in here?

we absolutely should! great catch

@jstuczyn
Copy link
Contributor Author

common/socks5-client-core/src/socks/mixnet_responses.rs line 137 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

break here?

yep.

@jstuczyn
Copy link
Contributor Author

common/verloc/src/measurements/listener.rs line 69 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Nit: probably you could omit async move {} as run_until_cancelled() returns future.

fixed

@jstuczyn
Copy link
Contributor Author

gateway/src/node/mod.rs line 522 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Nit: you could probably cancel_token.cancelled_owned() to avoid creating another future.

fixed

@jstuczyn
Copy link
Contributor Author

gateway/src/node/client_handling/embedded_clients/mod.rs line 70 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Missing ;?

Done.

@jstuczyn
Copy link
Contributor Author

gateway/src/node/client_handling/websocket/connection_handler/authenticated.rs line 598 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Missing cancellation handling in the new loop

cancellation is handled by top FreshHandler::start_handling method, i.e.:

 pub(crate) async fn start_handling(self)
    where
        S: AsyncRead + AsyncWrite + Unpin + Send,
        R: CryptoRng + RngCore + Send,
    {
        let remote = self.peer_address;
        let shutdown = self.shutdown.clone();
        tokio::select! {
            _ = shutdown.cancelled() => {
                trace!("received cancellation")
            }
            _ = super::handle_connection(self) => {
                debug!("finished connection handler for {remote}")
            }
        }
    }

this is because otherwise we'd have to have a lot of cancellation matches all over the place and I think this way is easier.

@jstuczyn
Copy link
Contributor Author

nym-node/src/node/metrics/console_logger.rs line 128 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

missing run_until_cancelled of some sorts?

similar case to other simple tasks being managed by the client

@jstuczyn jstuczyn force-pushed the feature/cancellation-migration branch from 8207777 to cbfa305 Compare September 10, 2025 12:27
Copy link
Contributor Author

@jstuczyn jstuczyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded and fixed to all issued you pointed out. thanks for checking it out!

Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @durch, @neacsu, and @pronebird)

@jstuczyn jstuczyn merged commit 0ee387d into develop Sep 10, 2025
20 of 23 checks passed
@jstuczyn jstuczyn deleted the feature/cancellation-migration branch September 10, 2025 12:56
Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 125 files reviewed, 11 unresolved discussions


common/client-core/src/client/received_buffer.rs line 558 at r2 (raw file):

Previously, jstuczyn (Jędrzej Stuczyński) wrote…

I probably just gone in an auto mode and if the previous iteration was using wait_with_delay, I'd added wait for cancellation

I didn't bother to comment on all instances because there were too many. Hope those are fixed too.

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 125 files reviewed, 8 unresolved discussions


common/client-core/src/client/topology_control/mod.rs line 160 at r2 (raw file):

Previously, jstuczyn (Jędrzej Stuczyński) wrote…

ibid.

I don't know what that means but I assume same as the message before...

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 125 files reviewed, 3 unresolved discussions


gateway/src/node/client_handling/websocket/connection_handler/authenticated.rs line 598 at r2 (raw file):

Previously, jstuczyn (Jędrzej Stuczyński) wrote…

cancellation is handled by top FreshHandler::start_handling method, i.e.:

 pub(crate) async fn start_handling(self)
    where
        S: AsyncRead + AsyncWrite + Unpin + Send,
        R: CryptoRng + RngCore + Send,
    {
        let remote = self.peer_address;
        let shutdown = self.shutdown.clone();
        tokio::select! {
            _ = shutdown.cancelled() => {
                trace!("received cancellation")
            }
            _ = super::handle_connection(self) => {
                debug!("finished connection handler for {remote}")
            }
        }
    }

this is because otherwise we'd have to have a lot of cancellation matches all over the place and I think this way is easier.

Ok fair. Thanks for sharing. A bit hard to trace the entire execution in a code review.

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 112 of 125 files reviewed, 1 unresolved discussion

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pronebird reviewed 13 of 13 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants