Skip to content

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Nov 17, 2025

https://github.com/MaterializeInc/database-issues/issues/9729 is an example of a recurring class of issue:

  • When Materialize shuts down (for good or ill), the two Tokio runtimes are shut down.
  • Shutting-down Tokio runtimes cancel tasks.
  • Our code is generally unprepared for tasks to fail. We abort on panic outside of unit tests; and our Tokio wrappers don't allow both aborting and joining on the result of a task. The only case where joining on a task might return a join-error in production is during this disordered shutdown procedure.
  • mz_ore had a utility for this: JoinHandle::wait_and_assert_finished. This code takes advantage of the fact that the only reason for cancellation is runtime shutdown, and just hang out and wait for itself to be shut down as well. We've been adding this method to join points whack-a-mole as new errors crop up.

This patch just makes the wait_and_assert_finished behaviour the overall default for our JoinHandle wrapper. This should remove a class of bug and eliminates a bunch of sketch error handling across the codebase.

Motivation

Tired of bugs like https://github.com/MaterializeInc/database-issues/issues/9729!

Tips for reviewer

One case where join errors are still meaningful: catching panics in async tests. In those cases we just unwrap the join handle to expose underlying Tokio handle, which seems fine.

@bkirwi bkirwi force-pushed the perfect-handle branch 7 times, most recently from a8d3e24 to 917d47b Compare November 17, 2025 22:51
@bkirwi bkirwi changed the title [ore] Perfect handle [ore] Infallible mz_ore::task::JoinHandle Nov 18, 2025
@bkirwi bkirwi marked this pull request as ready for review November 18, 2025 18:47
@bkirwi bkirwi requested review from a team and aljoscha as code owners November 18, 2025 18:47
@def-
Copy link
Contributor

def- commented Nov 18, 2025

Thansk for triggering nightly, please ignore the orchestratord failure, it's already fixed on main!

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification at the call sites and the reasoning seems sound to me. The only thing that's not great is that in non-abort-on-panic contexts you now need to remember to call into_tokio_handle or risk hanging forever when a task panics. As you say, we should have these contexts only in tests and hanging in a test is better than panicking in production.

@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 19, 2025

The only thing that's not great is that in non-abort-on-panic contexts you now need to remember to call into_tokio_handle or risk hanging forever when a task panics.

I don't think that's right - the implementation (taken from wait_and_assert_finished) is meant to re-raise panics if they appear.

    match err.try_into_panic() {
      Ok(panic) => std::panic::resume_unwind(panic),

I think this is probably better, since it avoids hiding errors, but let me know if that changes your opinion on it!

Tasks can fail in theory for two ways:
- Panic! However, we disable panics in production.
  (And for tests, we percolate the panic upward.)
- Task is explicitly aborted. This is statically ruled out by the
  wrapper.
- Runtime is shutting down. This is not interesting - it means the
  process is shutting down - and there's no value to percolating errors
  in that case. Here, we just remain pending indefinitely until the
  runtime dies.
@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 19, 2025

Oops, bad rebase - checking it out. Rebase was fine - CI hit docker issues.

@bkirwi bkirwi marked this pull request as draft November 19, 2025 18:49
@bkirwi bkirwi marked this pull request as ready for review November 19, 2025 19:06
@teskje
Copy link
Contributor

teskje commented Nov 19, 2025

Ah, sorry I meant when a thread panics. Although I'm not sure this is even a concern. If a runtime thread panics, does that abort all the tasks that were running on it? Or will those tasks be picked up by another runtime thread?

Anyway, that seems like a niche enough concern to not be worth worrying about.

@bkirwi bkirwi merged commit 0cb63d5 into MaterializeInc:main Nov 21, 2025
131 checks passed
@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 21, 2025

If a runtime thread panics, does that abort all the tasks that were running on it? Or will those tasks be picked up by another runtime thread?

Not an expert, but I think the answer is:

  • If the panic occurs while polling the task's future, the panic is captured and returned via the join handle, the future is dropped, and the runtime thread continues with the other tasks in the queue.
  • Otherwise... I guess that's a Tokio bug? I'm not sure what happens to tasks in that case, but probably if Tokio is panicking internally we have bigger problems...

@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 21, 2025

Also thank you for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants