Skip to content

fix(scheduler): contain worker fatal errors#420

Draft
ClSlaid wants to merge 1 commit into
openinfer-project:mainfrom
ClSlaid:fix/worker-fatal-containment
Draft

fix(scheduler): contain worker fatal errors#420
ClSlaid wants to merge 1 commit into
openinfer-project:mainfrom
ClSlaid:fix/worker-fatal-containment

Conversation

@ClSlaid

@ClSlaid ClSlaid commented Jun 17, 2026

Copy link
Copy Markdown

Purpose

Related: #403

Draft PR for design feedback on worker-fatal containment and typed execution errors. This is intentionally opened early to discuss the failure model, not to claim the broader fail-safe story is complete.

Summary

  • add shared typed execution errors plus an engine health fatal latch
  • make Qwen3 worker panics report WorkerPanic once, then exit the rank worker
  • make Qwen3 scheduler mark the engine unhealthy on domain-fatal errors and reject future work explicitly
  • preserve recoverable step errors as request/step-local failures where the worker domain remains trustworthy
  • expose unhealthy engine state through frontend /health as HTTP 503
  • convert obvious worker hot-path GEMM launch failures from unchecked panic wrappers into Result propagation

Follow-up issues called out in docs

  • replace state-machine anyhow::Result with layered typed model/runtime errors
  • design retry/reschedule after execution-domain failure, including TP vs DP boundaries and streamed-token semantics
  • adopt the shared ExecutionError / EngineHealth contract in other model schedulers

Validation

  • cargo check --release -p openinfer-engine --lib
  • cargo check --release -p openinfer-qwen3-4b --lib
  • cargo test --release -p openinfer-engine --lib
  • cargo test --release -p openinfer-vllm-frontend --lib
  • cargo test --release -p openinfer-qwen3-4b --lib scheduler -- --nocapture
  • cargo test --release -p openinfer-qwen3-4b --lib
  • cargo fmt --check
  • git diff --check

Surface worker-domain failures as typed execution errors instead of letting scheduler paths wedge on closed channels. Qwen3 rank workers now catch panics, report a fatal WorkerPanic once, and exit; the scheduler marks the shared engine health unhealthy, fails bound work, and rejects later submissions explicitly.

Keep recoverable step failures request-local where the worker domain remains trustworthy, and expose unhealthy engine state through frontend /health as HTTP 503. Also route obvious worker hot-path GEMM launch failures through Result propagation instead of unchecked panic wrappers.

This is intentionally containment, not full fail-safe recovery. Follow-up work should replace remaining state-machine anyhow::Result usage with layered typed errors and design retry/reschedule semantics for failed execution domains.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant