Skip to content

fix(scheduler): guard stale worker writes#230

Open
aparajon wants to merge 1 commit into
mainfrom
armand/scheduler-lease-ownership
Open

fix(scheduler): guard stale worker writes#230
aparajon wants to merge 1 commit into
mainfrom
armand/scheduler-lease-ownership

Conversation

@aparajon
Copy link
Copy Markdown
Collaborator

@aparajon aparajon commented Jun 4, 2026

Why

Workers can be recovered after an apply heartbeat becomes stale, but an old worker or observer may still be running locally. Control-plane state should only be writable by the claim that currently owns the apply, so stale workers fail closed instead of publishing outdated state.

What

  • Add a rotating apply lease token when an apply is claimed
  • Guard apply, task, log, heartbeat, control-request, and observer side effects by the active lease
  • Cancel local apply work when the heartbeat loop detects lease loss
  • Document the split between heartbeat recovery and lease-token write ownership
  • Add regression coverage for stale worker writes and stale observer terminal side effects

Claim lease flow

1. worker A claims the apply

   ╭──────────╮        claim         ╭──────────────────────────╮
   │ worker A │─────────────────────▶│ applies row              │
   │ token A  │                      │ updated_at = fresh       │
   ╰──────────╯                      │ lease_token = token A    │
                                     ╰──────────────────────────╯

2. worker A stops heartbeating long enough for recovery

   ╭──────────╮                      ╭──────────────────────────╮
   │ worker A │                      │ applies row              │
   │ token A  │                      │ updated_at = stale       │
   ╰──────────╯                      │ lease_token = token A    │
                                     ╰──────────────────────────╯

3. worker B claims the stale apply and rotates the token

   ╭──────────╮        claim         ╭──────────────────────────╮
   │ worker B │─────────────────────▶│ applies row              │
   │ token B  │                      │ updated_at = fresh       │
   ╰──────────╯                      │ lease_token = token B    │
                                     ╰──────────────────────────╯

4. worker A wakes up and tries to write with token A

   ╭──────────╮   write WHERE token=A ╭──────────────────────────╮
   │ worker A │──────────────────────▶│ applies row              │
   │ token A  │       rejected        │ lease_token = token B    │
   ╰──────────╯                       ╰──────────────────────────╯

Generated with Amp

@aparajon aparajon force-pushed the armand/scheduler-lease-ownership branch from 3a43110 to 5546795 Compare June 4, 2026 04:10
@aparajon aparajon changed the base branch from armand/cutover-recovery-state to main June 4, 2026 04:10
Copilot AI review requested due to automatic review settings June 4, 2026 04:18
@aparajon aparajon force-pushed the armand/scheduler-lease-ownership branch from 5546795 to e062688 Compare June 4, 2026 04:18
@aparajon aparajon marked this pull request as ready for review June 4, 2026 04:22
@aparajon aparajon requested review from Kiran01bm and morgo as code owners June 4, 2026 04:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an apply lease token mechanism so only the currently-claimed scheduler worker can perform “owned” writes and external side effects for an apply, preventing stale workers/observers from publishing outdated state after heartbeat-based recovery.

Changes:

  • Add lease ownership fields to applies (owner, rotating token, acquired timestamp) and plumb the lease through scheduler claim and context.
  • Guard apply/task/log writes (and control-request completion) behind the active lease token; add CheckLease API for non-mutating verification.
  • Guard webhook comment observer progress/terminal side effects by verifying the active lease; add regression tests and documentation updates.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/webhook/handler.go Passes the active apply lease into the comment observer configuration.
pkg/webhook/comment_observer.go Skips GitHub side effects for progress/terminal events when the lease is no longer current.
pkg/webhook/apply_comment_e2e_test.go Adds E2E coverage ensuring stale observers don’t perform terminal side effects after lease loss.
pkg/tern/local_control_resume.go Adjusts resume execution contexts to avoid cancellation propagation; used by resume paths.
pkg/tern/local_client_integration_test.go Updates tests for new FindNextApply(ctx, owner) signature.
pkg/tern/local_apply_sequential.go Cancels heartbeat loop on lease-loss error (intended to stop stale workers).
pkg/tern/grpc_client_test.go Updates mock ApplyStore for the new CheckLease method.
pkg/tern/control_requests.go Ensures completing/failing control requests is gated by the active apply lease (when present).
pkg/storage/types.go Adds lease fields to storage.Apply and a helper to build ApplyLease.
pkg/storage/storage.go Extends ApplyStore with owner-aware claiming and CheckLease.
pkg/storage/mysqlstore/tasks.go Guards task updates with lease token predicates and returns lease-loss errors on mismatch.
pkg/storage/mysqlstore/apply_logs.go Guards apply log appends with lease token checks and returns lease-loss errors on mismatch.
pkg/storage/mysqlstore/applies.go Implements lease token rotation on claim, lease-guarded updates/heartbeats, and CheckLease.
pkg/storage/mysqlstore/applies_test.go Adds regression coverage for lease-guarded owned writes and updates claim call sites.
pkg/storage/lease.go Introduces context plumbing for ApplyLease.
pkg/storage/errors.go Adds ErrApplyLeaseLost for stale worker ownership failures.
pkg/schema/mysql/applies.sql Adds lease columns to the MySQL applies schema.
pkg/api/scheduler.go Claims applies with a stable owner identity, attaches lease to context, and handles lease-loss errors.
pkg/api/handlers_test.go Updates ApplyStore test double to support new claim signature and CheckLease.
integration/scheduler_test.go Updates integration tests for the new claim signature.
docs/architecture.md Documents the heartbeat vs. lease-token split and stale-worker failure behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +244 to +248
"database", apply.Database,
"environment", apply.Environment,
"error", err)
cancel()
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one seems legit..

fmt.Sprintf("Resume requested (sequential): %d tasks to resume, %d already completed", len(resumeTasks), completedCount), oldApplyState, state.Apply.Running)

resumeCtx, cancelResume := context.WithCancel(context.Background())
resumeCtx, cancelResume := context.WithCancel(context.WithoutCancel(ctx))
}

stopHeartbeat := c.startApplyHeartbeat(context.Background(), apply)
resumeCtx := context.WithoutCancel(ctx)
Comment on lines +110 to +115
if hasLease {
leasePredicate = `
AND EXISTS (
SELECT 1 FROM applies a
WHERE a.id = tasks.apply_id AND a.lease_token = ?
)`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHERE lease_token = ? predicates are on applies, tasks, apply_logs, and Heartbeat, but scheduler-owned writes against apply_operations, apply_comments, checks, and the control-request complete/fail paths are still unconditional at the SQL layer...

perhaps extend (similar to this) same lease-token guard to those so that the check is atomic with the write instead of a separate round-trip.

}
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
if err := o.stor.Applies().CheckLease(ctx, lease); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit history overwrites - details from my agent -

Apply 42, current lease = TOK_A held by worker A.

t=0ms     OnTerminal fires on A
t=1ms     leaseStillOwnsObserver: CheckLease(TOK_A) → SELECT returns 1 row ✅
t=2ms     returns true
            ────────────────────────────────────────────────────
            window where another worker can rotate the lease
            ────────────────────────────────────────────────────
t=8ms     worker B reclaims → rotates row's lease_token to TOK_B
          (A still holds TOK_A in memory; A's view of ownership is now stale)
t=15ms    A enters editTrackedComment: client.EditIssueComment(...) hits GitHub
            → the "failed: stale worker error" body gets posted to the PR
t=600ms   GitHub responds 200; A then calls IncrementEditCount
            → that storage call is *also* unguarded against the lease, so the
              edit-count row gets bumped under the wrong owner
t=2s      B's OnProgress tries to edit the same comment with the real
          "running" body → PR comment flips back, leaving the operator with
          a confusing edit history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants