[ore] Infallible `mz_ore::task::JoinHandle` #34180

bkirwi · 2025-11-17T21:21:28Z

https://github.com/MaterializeInc/database-issues/issues/9729 is an example of a recurring class of issue:

When Materialize shuts down (for good or ill), the two Tokio runtimes are shut down.
Shutting-down Tokio runtimes cancel tasks.
Our code is generally unprepared for tasks to fail. We abort on panic outside of unit tests; and our Tokio wrappers don't allow both aborting and joining on the result of a task. The only case where joining on a task might return a join-error in production is during this disordered shutdown procedure.
mz_ore had a utility for this: JoinHandle::wait_and_assert_finished. This code takes advantage of the fact that the only reason for cancellation is runtime shutdown, and just hang out and wait for itself to be shut down as well. We've been adding this method to join points whack-a-mole as new errors crop up.

This patch just makes the wait_and_assert_finished behaviour the overall default for our JoinHandle wrapper. This should remove a class of bug and eliminates a bunch of sketch error handling across the codebase.

Motivation

Tired of bugs like https://github.com/MaterializeInc/database-issues/issues/9729!

Tips for reviewer

One case where join errors are still meaningful: catching panics in async tests. In those cases we just unwrap the join handle to expose underlying Tokio handle, which seems fine.

def- · 2025-11-18T22:13:29Z

Thansk for triggering nightly, please ignore the orchestratord failure, it's already fixed on main!

teskje

I like the simplification at the call sites and the reasoning seems sound to me. The only thing that's not great is that in non-abort-on-panic contexts you now need to remember to call into_tokio_handle or risk hanging forever when a task panics. As you say, we should have these contexts only in tests and hanging in a test is better than panicking in production.

src/adapter/src/catalog/apply.rs

src/ore/src/task.rs

bkirwi · 2025-11-19T15:27:49Z

The only thing that's not great is that in non-abort-on-panic contexts you now need to remember to call into_tokio_handle or risk hanging forever when a task panics.

I don't think that's right - the implementation (taken from wait_and_assert_finished) is meant to re-raise panics if they appear.

    match err.try_into_panic() {
      Ok(panic) => std::panic::resume_unwind(panic),

I think this is probably better, since it avoids hiding errors, but let me know if that changes your opinion on it!

Tasks can fail in theory for two ways: - Panic! However, we disable panics in production. (And for tests, we percolate the panic upward.) - Task is explicitly aborted. This is statically ruled out by the wrapper. - Runtime is shutting down. This is not interesting - it means the process is shutting down - and there's no value to percolating errors in that case. Here, we just remain pending indefinitely until the runtime dies.

bkirwi · 2025-11-19T18:49:09Z

~~Oops, bad rebase - checking it out.~~ Rebase was fine - CI hit docker issues.

teskje · 2025-11-19T20:25:12Z

Ah, sorry I meant when a thread panics. Although I'm not sure this is even a concern. If a runtime thread panics, does that abort all the tasks that were running on it? Or will those tasks be picked up by another runtime thread?

Anyway, that seems like a niche enough concern to not be worth worrying about.

bkirwi · 2025-11-21T15:55:59Z

If a runtime thread panics, does that abort all the tasks that were running on it? Or will those tasks be picked up by another runtime thread?

Not an expert, but I think the answer is:

If the panic occurs while polling the task's future, the panic is captured and returned via the join handle, the future is dropped, and the runtime thread continues with the other tasks in the queue.
Otherwise... I guess that's a Tokio bug? I'm not sure what happens to tasks in that case, but probably if Tokio is panicking internally we have bigger problems...

bkirwi · 2025-11-21T15:58:07Z

Also thank you for the review!

bkirwi force-pushed the perfect-handle branch 7 times, most recently from a8d3e24 to 917d47b Compare November 17, 2025 22:51

bkirwi changed the title ~~[ore] Perfect handle~~ [ore] Infallible mz_ore::task::JoinHandle Nov 18, 2025

bkirwi force-pushed the perfect-handle branch from 917d47b to b8dd190 Compare November 18, 2025 16:05

bkirwi marked this pull request as ready for review November 18, 2025 18:47

bkirwi requested review from a team and aljoscha as code owners November 18, 2025 18:47

teskje approved these changes Nov 19, 2025

View reviewed changes

src/adapter/src/catalog/apply.rs Outdated Show resolved Hide resolved

src/ore/src/task.rs Outdated Show resolved Hide resolved

src/ore/src/task.rs Outdated Show resolved Hide resolved

src/ore/src/task.rs Outdated Show resolved Hide resolved

bkirwi force-pushed the perfect-handle branch from b8dd190 to 2a6ac3a Compare November 19, 2025 15:28

bkirwi marked this pull request as draft November 19, 2025 18:49

bkirwi marked this pull request as ready for review November 19, 2025 19:06

bkirwi merged commit 0cb63d5 into MaterializeInc:main Nov 21, 2025
131 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ore] Infallible `mz_ore::task::JoinHandle` #34180

[ore] Infallible `mz_ore::task::JoinHandle` #34180

bkirwi commented Nov 17, 2025 •

edited

Loading

Uh oh!

def- commented Nov 18, 2025

Uh oh!

teskje left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkirwi commented Nov 19, 2025

Uh oh!

bkirwi commented Nov 19, 2025 •

edited

Loading

Uh oh!

teskje commented Nov 19, 2025

Uh oh!

Uh oh!

bkirwi commented Nov 21, 2025

Uh oh!

bkirwi commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ore] Infallible mz_ore::task::JoinHandle #34180

[ore] Infallible mz_ore::task::JoinHandle #34180

Conversation

bkirwi commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Uh oh!

def- commented Nov 18, 2025

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkirwi commented Nov 19, 2025

Uh oh!

bkirwi commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teskje commented Nov 19, 2025

Uh oh!

Uh oh!

bkirwi commented Nov 21, 2025

Uh oh!

bkirwi commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ore] Infallible `mz_ore::task::JoinHandle` #34180

[ore] Infallible `mz_ore::task::JoinHandle` #34180

bkirwi commented Nov 17, 2025 •

edited

Loading

bkirwi commented Nov 19, 2025 •

edited

Loading