Skip to content

Conversation

@dignifiedquire
Copy link
Contributor

Description

An attempt at improving the speed and reliability of the shutdown in magicsock

  • set closing asap
  • ensure remote actors are shut down using cancellation tokens
  • shutdown actors before waiting for wait_idle
  • use compare_exchange to improve closing logic

Ref #3762

@dignifiedquire dignifiedquire marked this pull request as ready for review December 11, 2025 10:20
@github-actions
Copy link

github-actions bot commented Dec 11, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3763/docs/iroh/

Last updated: 2025-12-12T23:16:02Z

@n0bot n0bot bot added this to iroh Dec 11, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Dec 11, 2025

let duration = start.elapsed();
println!(
"Received {} in {:.4}s ({}/s, time to first byte {}s, {} chunks)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we move this log above the close right after drain_stream, and add another log "Closed cleanly" or such after the close timeout?
Then when we run it and it completes the transfer but only fails during shutdown we can see that, and not like now where there's no visible difference between the cases (because if the shutdown errors the whole fn errors)

.compare_exchange(false, true, Ordering::Acquire, Ordering::Relaxed)
.is_err()
{
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that if you call Endpoint::close twice on a clone of the endpoint, only the first call will wait for quinn::Endpoint::wait_idle, whereas the second call will complete immediately. I think we should always await Endpoint::wait_idle, even if closing is already true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, though we never actually come here, because Endpoint::close shortcircuits way before this

}

// Cancel any running netreports
self.shutdown_token_netreport.cancel();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have quite a few variables in play and I'm wondering if we can simplify it: I think we could remove self.msock.closing and instead use only two cancel tokens:
self.shutdown_token_close_start and self.shutdown_token_close_endpoint_closed
The former would be passed to netreport (i.e. replace shutdown_token_netreport and the closing AtomicBool), and the latter would be used for everything else.

(i.e. instead of having both a cancel token and an atomicbool, just use cancel_token.is_cancelled() for when the atomicbool was used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by merging @FRANDOS PR

@Frando Frando mentioned this pull request Dec 12, 2025
11 tasks
/// Packets queued to send to the client
send_queue: mpsc::Receiver<Packet>,
/// Important packets queued to send to the client
disco_send_queue: mpsc::Receiver<Packet>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea we were prioritising these! That was probably a good call :)

);
if res.is_err() {
println!("[{remote}] Did not disconnect within 3 seconds");
println!("[{remote}] Error: Did not disconnect within 4 seconds");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erm? uses 3s above?


#[tokio::test]
#[traced_test]
// #[traced_test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this was debugging attempts that you forgot to clean up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 😅

let barrier = Arc::new(tokio::sync::Barrier::new(2));

// The server accepts the connections of the clients sequentially.
let s_b = barrier.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some comments as to why this is needed as otherwise the next reader will be puzzled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I am not sure yet I will leave it in, but it makes it a lot saner to debug these tests


fn is_closing(&self) -> bool {
self.closing.load(Ordering::Relaxed)
self.closing.load(Ordering::SeqCst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone smarter than me says that using SeqCst means you haven't really thought about what you're doing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gone 🎉

Comment on lines +1002 to +975
if self.msock.is_closed() {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even move that before the trace?

}
None => {}
None => {
debug!("report canceled");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably be trace since they're entirely normal behaviour?

dignifiedquire pushed a commit that referenced this pull request Dec 12, 2025
## Description

Based on #3763
Tries to simplify shutdown state to be simpler to reason about.

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist
<!-- Remove any that are not relevant. -->
- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.
- [ ] List all breaking changes in the above "Breaking Changes" section.
- [ ] Open an issue or PR on any number0 repos that are affected by this
breaking change. Give guidance on how the updates should be handled or
do the actual updates themselves. The major ones are:
    - [ ] [`quic-rpc`](https://github.com/n0-computer/quic-rpc)
    - [ ] [`iroh-gossip`](https://github.com/n0-computer/iroh-gossip)
    - [ ] [`iroh-blobs`](https://github.com/n0-computer/iroh-blobs)
    - [ ] [`dumbpipe`](https://github.com/n0-computer/dumbpipe)
    - [ ] [`sendme`](https://github.com/n0-computer/sendme)
dignifiedquire and others added 18 commits December 12, 2025 23:17
- set closing asap
- ensure remote actors are shut down using cancellation tokens
- shutdown actors before waiting for `wait_idle`
## Description

Based on #3763
Tries to simplify shutdown state to be simpler to reason about.

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist
<!-- Remove any that are not relevant. -->
- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.
- [ ] List all breaking changes in the above "Breaking Changes" section.
- [ ] Open an issue or PR on any number0 repos that are affected by this
breaking change. Give guidance on how the updates should be handled or
do the actual updates themselves. The major ones are:
    - [ ] [`quic-rpc`](https://github.com/n0-computer/quic-rpc)
    - [ ] [`iroh-gossip`](https://github.com/n0-computer/iroh-gossip)
    - [ ] [`iroh-blobs`](https://github.com/n0-computer/iroh-blobs)
    - [ ] [`dumbpipe`](https://github.com/n0-computer/dumbpipe)
    - [ ] [`sendme`](https://github.com/n0-computer/sendme)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

4 participants