fix(scheduler): contain worker fatal errors#420
Draft
ClSlaid wants to merge 1 commit into
Draft
Conversation
Surface worker-domain failures as typed execution errors instead of letting scheduler paths wedge on closed channels. Qwen3 rank workers now catch panics, report a fatal WorkerPanic once, and exit; the scheduler marks the shared engine health unhealthy, fails bound work, and rejects later submissions explicitly. Keep recoverable step failures request-local where the worker domain remains trustworthy, and expose unhealthy engine state through frontend /health as HTTP 503. Also route obvious worker hot-path GEMM launch failures through Result propagation instead of unchecked panic wrappers. This is intentionally containment, not full fail-safe recovery. Follow-up work should replace remaining state-machine anyhow::Result usage with layered typed errors and design retry/reschedule semantics for failed execution domains.
48a51a4 to
0723f0b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Related: #403
Draft PR for design feedback on worker-fatal containment and typed execution errors. This is intentionally opened early to discuss the failure model, not to claim the broader fail-safe story is complete.
Summary
WorkerPaniconce, then exit the rank worker/healthas HTTP 503ResultpropagationFollow-up issues called out in docs
anyhow::Resultwith layered typed model/runtime errorsExecutionError/EngineHealthcontract in other model schedulersValidation
cargo check --release -p openinfer-engine --libcargo check --release -p openinfer-qwen3-4b --libcargo test --release -p openinfer-engine --libcargo test --release -p openinfer-vllm-frontend --libcargo test --release -p openinfer-qwen3-4b --lib scheduler -- --nocapturecargo test --release -p openinfer-qwen3-4b --libcargo fmt --checkgit diff --check