Skip to content

fix: prevent worker pool thrashing in cluster distribution layer#22

Merged
sylvesterdamgaard merged 1 commit into
mainfrom
fix-worker-pool-thrashing
Apr 30, 2026
Merged

fix: prevent worker pool thrashing in cluster distribution layer#22
sylvesterdamgaard merged 1 commit into
mainfrom
fix-worker-pool-thrashing

Conversation

@sylvesterdamgaard
Copy link
Copy Markdown
Contributor

Summary

Fixes worker pool thrashing where current_workers oscillated ±2 every few seconds despite a stable cluster-wide target. The autoscaler was killing and respawning workers in rapid succession due to sort-order instability in the distribution layer.

  • Root cause: distributeClusterTarget() sorted managers by live heartbeat worker counts ($currentCounts), which fluctuate tick-to-tick as workers spawn/die. Different sort orders produced different per-host assignments each cycle, even when the cluster-wide target was unchanged.
  • Fix: Added distribution hysteresis — cache previous per-workload host assignments and reuse them when the target sum, active manager set, and per-host capacity are all unchanged. Fresh computation only triggers when the target actually changes, a manager joins/leaves, or capacity constraints shift.
  • Tests: 6 new tests including a 30-cycle stability invariant that replays fluctuating heartbeat counts and asserts identical assignments across all cycles.

Closes #21

Test plan

  • 30-cycle stability invariant: stable target with fluctuating heartbeat counts produces identical assignments every cycle
  • Recomputes distribution when target changes (scale up/down)
  • Recomputes when a manager joins the cluster
  • Recomputes when a manager leaves the cluster
  • Recomputes when cached assignments exceed host capacity (other workloads consumed slots)
  • All 550 existing tests pass — no regressions
  • PHPStan: no new errors
  • Pint: formatted

🤖 Generated with Claude Code

distributeClusterTarget() sorted managers by live heartbeat worker counts,
causing different per-host assignments each cycle even when the cluster-wide
target was stable. Added distribution hysteresis: cache previous assignments
and reuse them when the target, manager set, and per-host capacity are
unchanged. Fresh computation only triggers on actual target changes, manager
joins/leaves, or capacity constraint shifts.

Closes #21

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sylvesterdamgaard sylvesterdamgaard merged commit fa330ee into main Apr 30, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v3.6.0 worker pool thrashing: current_workers oscillates ±2 every few seconds despite stable target

1 participant