-
Notifications
You must be signed in to change notification settings - Fork 327
fix(iroh): improve shutdown logic in the magicsock #3763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat-multipath
Are you sure you want to change the base?
Conversation
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3763/docs/iroh/ Last updated: 2025-12-12T23:16:02Z |
|
|
||
| let duration = start.elapsed(); | ||
| println!( | ||
| "Received {} in {:.4}s ({}/s, time to first byte {}s, {} chunks)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about we move this log above the close right after drain_stream, and add another log "Closed cleanly" or such after the close timeout?
Then when we run it and it completes the transfer but only fails during shutdown we can see that, and not like now where there's no visible difference between the cases (because if the shutdown errors the whole fn errors)
iroh/src/magicsock.rs
Outdated
| .compare_exchange(false, true, Ordering::Acquire, Ordering::Relaxed) | ||
| .is_err() | ||
| { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that if you call Endpoint::close twice on a clone of the endpoint, only the first call will wait for quinn::Endpoint::wait_idle, whereas the second call will complete immediately. I think we should always await Endpoint::wait_idle, even if closing is already true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, though we never actually come here, because Endpoint::close shortcircuits way before this
iroh/src/magicsock.rs
Outdated
| } | ||
|
|
||
| // Cancel any running netreports | ||
| self.shutdown_token_netreport.cancel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have quite a few variables in play and I'm wondering if we can simplify it: I think we could remove self.msock.closing and instead use only two cancel tokens:
self.shutdown_token_close_start and self.shutdown_token_close_endpoint_closed
The former would be passed to netreport (i.e. replace shutdown_token_netreport and the closing AtomicBool), and the latter would be used for everything else.
(i.e. instead of having both a cancel token and an atomicbool, just use cancel_token.is_cancelled() for when the atomicbool was used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved by merging @FRANDOS PR
| /// Packets queued to send to the client | ||
| send_queue: mpsc::Receiver<Packet>, | ||
| /// Important packets queued to send to the client | ||
| disco_send_queue: mpsc::Receiver<Packet>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea we were prioritising these! That was probably a good call :)
| ); | ||
| if res.is_err() { | ||
| println!("[{remote}] Did not disconnect within 3 seconds"); | ||
| println!("[{remote}] Error: Did not disconnect within 4 seconds"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erm? uses 3s above?
iroh/src/endpoint.rs
Outdated
|
|
||
| #[tokio::test] | ||
| #[traced_test] | ||
| // #[traced_test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this was debugging attempts that you forgot to clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes 😅
| let barrier = Arc::new(tokio::sync::Barrier::new(2)); | ||
|
|
||
| // The server accepts the connections of the clients sequentially. | ||
| let s_b = barrier.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some comments as to why this is needed as otherwise the next reader will be puzzled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I am not sure yet I will leave it in, but it makes it a lot saner to debug these tests
iroh/src/magicsock.rs
Outdated
|
|
||
| fn is_closing(&self) -> bool { | ||
| self.closing.load(Ordering::Relaxed) | ||
| self.closing.load(Ordering::SeqCst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone smarter than me says that using SeqCst means you haven't really thought about what you're doing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gone 🎉
| if self.msock.is_closed() { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe even move that before the trace?
iroh/src/net_report.rs
Outdated
| } | ||
| None => {} | ||
| None => { | ||
| debug!("report canceled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should probably be trace since they're entirely normal behaviour?
## Description Based on #3763 Tries to simplify shutdown state to be simpler to reason about. ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist <!-- Remove any that are not relevant. --> - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. - [ ] List all breaking changes in the above "Breaking Changes" section. - [ ] Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are: - [ ] [`quic-rpc`](https://github.com/n0-computer/quic-rpc) - [ ] [`iroh-gossip`](https://github.com/n0-computer/iroh-gossip) - [ ] [`iroh-blobs`](https://github.com/n0-computer/iroh-blobs) - [ ] [`dumbpipe`](https://github.com/n0-computer/dumbpipe) - [ ] [`sendme`](https://github.com/n0-computer/sendme)
- set closing asap - ensure remote actors are shut down using cancellation tokens - shutdown actors before waiting for `wait_idle`
## Description Based on #3763 Tries to simplify shutdown state to be simpler to reason about. ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist <!-- Remove any that are not relevant. --> - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. - [ ] List all breaking changes in the above "Breaking Changes" section. - [ ] Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are: - [ ] [`quic-rpc`](https://github.com/n0-computer/quic-rpc) - [ ] [`iroh-gossip`](https://github.com/n0-computer/iroh-gossip) - [ ] [`iroh-blobs`](https://github.com/n0-computer/iroh-blobs) - [ ] [`dumbpipe`](https://github.com/n0-computer/dumbpipe) - [ ] [`sendme`](https://github.com/n0-computer/sendme)
e5a46d9 to
4be5f94
Compare
Description
An attempt at improving the speed and reliability of the shutdown in magicsock
wait_idlecompare_exchangeto improveclosinglogicRef #3762