fix(scheduler): contain worker fatal errors by ClSlaid · Pull Request #420 · openinfer-project/openinfer

ClSlaid · 2026-06-17T18:26:15Z

Purpose

Related: #403

Draft PR for design feedback on worker-fatal containment and typed execution errors. This is intentionally opened early to discuss the failure model, not to claim the broader fail-safe story is complete.

Summary

add shared typed execution errors plus an engine health fatal latch
make Qwen3 worker panics report WorkerPanic once, then exit the rank worker
make Qwen3 scheduler mark the engine unhealthy on domain-fatal errors and reject future work explicitly
preserve recoverable step errors as request/step-local failures where the worker domain remains trustworthy
expose unhealthy engine state through frontend /health as HTTP 503
convert obvious worker hot-path GEMM launch failures from unchecked panic wrappers into Result propagation

Follow-up issues called out in docs

replace state-machine anyhow::Result with layered typed model/runtime errors
design retry/reschedule after execution-domain failure, including TP vs DP boundaries and streamed-token semantics
adopt the shared ExecutionError / EngineHealth contract in other model schedulers

Validation

cargo check --release -p openinfer-engine --lib
cargo check --release -p openinfer-qwen3-4b --lib
cargo test --release -p openinfer-engine --lib
cargo test --release -p openinfer-vllm-frontend --lib
cargo test --release -p openinfer-qwen3-4b --lib scheduler -- --nocapture
cargo test --release -p openinfer-qwen3-4b --lib
cargo fmt --check
git diff --check

Surface worker-domain failures as typed execution errors instead of letting scheduler paths wedge on closed channels. Qwen3 rank workers now catch panics, report a fatal WorkerPanic once, and exit; the scheduler marks the shared engine health unhealthy, fails bound work, and rejects later submissions explicitly. Keep recoverable step failures request-local where the worker domain remains trustworthy, and expose unhealthy engine state through frontend /health as HTTP 503. Also route obvious worker hot-path GEMM launch failures through Result propagation instead of unchecked panic wrappers. This is intentionally containment, not full fail-safe recovery. Follow-up work should replace remaining state-machine anyhow::Result usage with layered typed errors and design retry/reschedule semantics for failed execution domains.

ClSlaid force-pushed the fix/worker-fatal-containment branch from 48a51a4 to 0723f0b Compare June 17, 2026 18:29

n-WN mentioned this pull request Jul 1, 2026

fix(qwen3): size decode page indices for prefix-shared views #482

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(scheduler): contain worker fatal errors#420

fix(scheduler): contain worker fatal errors#420
ClSlaid wants to merge 1 commit into
openinfer-project:mainfrom
ClSlaid:fix/worker-fatal-containment

ClSlaid commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ClSlaid commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Summary

Follow-up issues called out in docs

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ClSlaid commented Jun 17, 2026 •

edited

Loading