fix: correct CPU capacity defaults and cluster-wide scaling logic by sylvesterdamgaard · Pull Request #13 · cboxdk/laravel-queue-autoscale

sylvesterdamgaard · 2026-04-29T22:39:08Z

Summary

Fixes three root-cause bugs causing autoscale to recommend scale-down at 100% utilization with pending SLA breaches in cluster mode.

1. CPU defaults too aggressive for containers

reserve_cpu_cores: 1 → 0.2 (type int → float throughout)
worker_cpu_core_estimate: 1.0 → 0.2
Usable cores guard: max(..., 1) → max(..., 0) for fractional cores
A 0.5-core container now correctly supports 1 worker instead of 0

2. Cluster leader applied local-host capacity to cluster-wide targets

clusterTargetWorkers() called ScalingEngine::evaluate() which uses local CPU/memory capacity against cluster-wide pool workers — scope mismatch
Added ScalingEngine::evaluateDemand(): strategy + config bounds only, no system capacity constraint
distributeClusterTarget() now respects per-host maxWorkers, skipping hosts at capacity
Result: required_workers reflects actual demand, not capacity-capped values

3. `clusterScaleSignal()` ignored utilization and backlog

Previously only checked "do required workers fit on fewer hosts?"
Now blocks scale-down when utilization ≥ 80% or any workload has pending jobs

Test plan

All 464 existing tests pass (1142 assertions)
PHPStan clean on changed files
Pint formatting passes
Deploy to staging cluster (2 hosts: 0.5-core + 2-core) and verify:
- Small host gets workers assigned
- Scale signal shows hold or scale_up under load, never scale_down
- required_workers reflects actual demand, not capped values

Three root-cause fixes for autoscale misbehavior in cluster mode: 1. CPU defaults were too aggressive for containerized workloads: - reserve_cpu_cores: 1 → 0.2 (int → float) - worker_cpu_core_estimate: 1.0 → 0.2 - max(..., 1) usable cores guard → max(..., 0) for fractional cores 2. Cluster leader applied local-host capacity to cluster-wide targets, strangling per-queue demand (e.g. fast queue capped to 1 worker despite needing 12). Added evaluateDemand() for unconstrained strategy + config bounds; distributeClusterTarget() now respects per-host maxWorkers. 3. clusterScaleSignal() recommended scale_down at 100% utilization because it only checked bin-packing. Now blocks scale-down when utilization ≥ 80% or any workload has pending jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Finding 1: clusterScaleSignal action-check used the raw spaceship int which happened to work at this call site, but was fragile. Replaced with direct target_workers > current_workers comparison — works regardless of whether action is int or string. Finding 3: Added 16 unit tests for new code paths: - evaluateDemand: config bounds respected, capacity ignored - clusterScaleSignal: hold at ≥80% util, hold with pending, boundary at 79% - distributeClusterTarget: per-host maxWorkers cap, skip full hosts, preserve existing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two CapacityCalculator tests compared maxWorkersByCpu across separate CPU measurements. On small CI runners the second measurement could spike, yielding 0 available cores and failing the comparison. Fixed by reusing the cached metrics snapshot — only the estimate changes between calls, making the comparison deterministic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set max_cpu_percent=100 in the two comparison tests so that even saturated CI runners (>85% CPU) have non-zero available headroom. Without this, availableCpuPercent=0 makes both estimates yield 0 workers and the comparison is meaningless. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI runners in cgroup containers without explicit CPU limits report 0 cores via system-metrics. With reserve_cpu_cores > 0, usableCores becomes 0 regardless of config, making both estimate comparisons yield 0 workers. Fix: set reserve_cpu_cores=0 in comparison tests and guard the worker-count comparison behind a capacity check. When system-metrics reports 0 cores (CI), verify the estimate value is correctly applied instead of comparing worker counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sylvesterdamgaard and others added 5 commits April 30, 2026 00:38

sylvesterdamgaard merged commit 19a8291 into main Apr 29, 2026
19 checks passed

This was referenced Apr 29, 2026

Target oscillates by 33%+ at the edge of stable load: missing hysteresis between scale-up and scale-down #15

Closed

Cluster cap distribution is first-queue-wins: missing fairness layer between queues #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct CPU capacity defaults and cluster-wide scaling logic#13

fix: correct CPU capacity defaults and cluster-wide scaling logic#13
sylvesterdamgaard merged 5 commits into
mainfrom
analyze-autoscale-display

sylvesterdamgaard commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sylvesterdamgaard commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. CPU defaults too aggressive for containers

2. Cluster leader applied local-host capacity to cluster-wide targets

3. clusterScaleSignal() ignored utilization and backlog

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sylvesterdamgaard commented Apr 29, 2026 •

edited

Loading

3. `clusterScaleSignal()` ignored utilization and backlog