fix(scheduler): guard stale worker writes#230
Conversation
3a43110 to
5546795
Compare
Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019e7456-b8ee-77aa-89bc-178cfab1140d
5546795 to
e062688
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces an apply lease token mechanism so only the currently-claimed scheduler worker can perform “owned” writes and external side effects for an apply, preventing stale workers/observers from publishing outdated state after heartbeat-based recovery.
Changes:
- Add lease ownership fields to
applies(owner, rotating token, acquired timestamp) and plumb the lease through scheduler claim and context. - Guard apply/task/log writes (and control-request completion) behind the active lease token; add
CheckLeaseAPI for non-mutating verification. - Guard webhook comment observer progress/terminal side effects by verifying the active lease; add regression tests and documentation updates.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/webhook/handler.go | Passes the active apply lease into the comment observer configuration. |
| pkg/webhook/comment_observer.go | Skips GitHub side effects for progress/terminal events when the lease is no longer current. |
| pkg/webhook/apply_comment_e2e_test.go | Adds E2E coverage ensuring stale observers don’t perform terminal side effects after lease loss. |
| pkg/tern/local_control_resume.go | Adjusts resume execution contexts to avoid cancellation propagation; used by resume paths. |
| pkg/tern/local_client_integration_test.go | Updates tests for new FindNextApply(ctx, owner) signature. |
| pkg/tern/local_apply_sequential.go | Cancels heartbeat loop on lease-loss error (intended to stop stale workers). |
| pkg/tern/grpc_client_test.go | Updates mock ApplyStore for the new CheckLease method. |
| pkg/tern/control_requests.go | Ensures completing/failing control requests is gated by the active apply lease (when present). |
| pkg/storage/types.go | Adds lease fields to storage.Apply and a helper to build ApplyLease. |
| pkg/storage/storage.go | Extends ApplyStore with owner-aware claiming and CheckLease. |
| pkg/storage/mysqlstore/tasks.go | Guards task updates with lease token predicates and returns lease-loss errors on mismatch. |
| pkg/storage/mysqlstore/apply_logs.go | Guards apply log appends with lease token checks and returns lease-loss errors on mismatch. |
| pkg/storage/mysqlstore/applies.go | Implements lease token rotation on claim, lease-guarded updates/heartbeats, and CheckLease. |
| pkg/storage/mysqlstore/applies_test.go | Adds regression coverage for lease-guarded owned writes and updates claim call sites. |
| pkg/storage/lease.go | Introduces context plumbing for ApplyLease. |
| pkg/storage/errors.go | Adds ErrApplyLeaseLost for stale worker ownership failures. |
| pkg/schema/mysql/applies.sql | Adds lease columns to the MySQL applies schema. |
| pkg/api/scheduler.go | Claims applies with a stable owner identity, attaches lease to context, and handles lease-loss errors. |
| pkg/api/handlers_test.go | Updates ApplyStore test double to support new claim signature and CheckLease. |
| integration/scheduler_test.go | Updates integration tests for the new claim signature. |
| docs/architecture.md | Documents the heartbeat vs. lease-token split and stale-worker failure behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "database", apply.Database, | ||
| "environment", apply.Environment, | ||
| "error", err) | ||
| cancel() | ||
| return |
| fmt.Sprintf("Resume requested (sequential): %d tasks to resume, %d already completed", len(resumeTasks), completedCount), oldApplyState, state.Apply.Running) | ||
|
|
||
| resumeCtx, cancelResume := context.WithCancel(context.Background()) | ||
| resumeCtx, cancelResume := context.WithCancel(context.WithoutCancel(ctx)) |
| } | ||
|
|
||
| stopHeartbeat := c.startApplyHeartbeat(context.Background(), apply) | ||
| resumeCtx := context.WithoutCancel(ctx) |
| if hasLease { | ||
| leasePredicate = ` | ||
| AND EXISTS ( | ||
| SELECT 1 FROM applies a | ||
| WHERE a.id = tasks.apply_id AND a.lease_token = ? | ||
| )` |
There was a problem hiding this comment.
WHERE lease_token = ? predicates are on applies, tasks, apply_logs, and Heartbeat, but scheduler-owned writes against apply_operations, apply_comments, checks, and the control-request complete/fail paths are still unconditional at the SQL layer...
perhaps extend (similar to this) same lease-token guard to those so that the check is atomic with the write instead of a separate round-trip.
| } | ||
| ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) | ||
| defer cancel() | ||
| if err := o.stor.Applies().CheckLease(ctx, lease); err != nil { |
There was a problem hiding this comment.
edit history overwrites - details from my agent -
Apply 42, current lease = TOK_A held by worker A.
t=0ms OnTerminal fires on A
t=1ms leaseStillOwnsObserver: CheckLease(TOK_A) → SELECT returns 1 row ✅
t=2ms returns true
────────────────────────────────────────────────────
window where another worker can rotate the lease
────────────────────────────────────────────────────
t=8ms worker B reclaims → rotates row's lease_token to TOK_B
(A still holds TOK_A in memory; A's view of ownership is now stale)
t=15ms A enters editTrackedComment: client.EditIssueComment(...) hits GitHub
→ the "failed: stale worker error" body gets posted to the PR
t=600ms GitHub responds 200; A then calls IncrementEditCount
→ that storage call is *also* unguarded against the lease, so the
edit-count row gets bumped under the wrong owner
t=2s B's OnProgress tries to edit the same comment with the real
"running" body → PR comment flips back, leaving the operator with
a confusing edit history
Why
Workers can be recovered after an apply heartbeat becomes stale, but an old worker or observer may still be running locally. Control-plane state should only be writable by the claim that currently owns the apply, so stale workers fail closed instead of publishing outdated state.
What
Claim lease flow
Generated with Amp