Skip to content

Conversation

@pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Oct 15, 2025

Canaries for system jobs are placed on a tg.update.canary percent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment.

The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called evictUnneededCanaries is called (much like evictAndPlace is for honoring MaxParallel) which removes those canary placements that are not needed. We also change the behavior of computePlacements which no longer performs node feasibility checks, as these are performed earlier for every allocation and node. This way we get accurate counts of all feasible nodes that let us correctly set deployment state fields.

Fixes: #26885
Fixes: #26886

@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 838bcd8 to e1234c1 Compare October 28, 2025 08:30
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from e1234c1 to a6cd581 Compare October 28, 2025 17:33
tgross added a commit that referenced this pull request Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this
ends up getting configured in the update block test for system deployments. This
test setup bug has complicated landing the fix in #26953.
tgross added a commit that referenced this pull request Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this
ends up getting configured in the update block test for system deployments. This
test setup bug has complicated landing the fix in #26953.
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 697d0ff to 0c70ac8 Compare October 28, 2025 18:29
@pkazmierczak pkazmierczak requested a review from jrasell October 31, 2025 08:11
… state

Previously we copied the behavior found in the generic scheduler, where
we rely on reconciler results to decide if there's enough placements
made. In the system scheduler we always know exactly how many placements
there should be based on the DesiredTotal field of the deployment state,
so a better way to check completeness of the deployment is to simplify
it and base on dstate alone.
…iler

In contrast to the cluster reconciler, in the node reconciler we go
node-by-node as opposed to alloc-by-alloc, and thus the state of the
reconciler has to be managed differently. If we override old deployments
on every run of `cancelUnneededDeployments`, we end up with
unnecessarily created deployments for job version that already had them.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's get this merged and then we can mop-up and remaining minor issues.


// computePlacements computes placements for allocations
func (s *SystemScheduler) computePlacements(place []reconciler.AllocTuple, existingByTaskGroup map[string]bool) error {
func (s *SystemScheduler) computePlacements(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cleanup later: this no longer matches the behavior of the other schedulers' computePlacements method. We should consider renaming this to something that describes what's happening here as distinct from that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. would you say placeAllocs is more apt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I think of "place" as "find a node for this alloc" and we've already done that. But I don't off the top of my head have a better verb.

Comment on lines +544 to +549
// we should have an entry for every node that is looked
// up. if we don't, something must be wrong
s.logger.Error("failed to locate node feasibility information",
"node-id", node.ID, "task_group", tgName)
// provide a stubbed metric to work with
metrics = &structs.AllocMetric{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment confused me a bit with the if option == nil block below:

Suggested change
// we should have an entry for every node that is looked
// up. if we don't, something must be wrong
s.logger.Error("failed to locate node feasibility information",
"node-id", node.ID, "task_group", tgName)
// provide a stubbed metric to work with
metrics = &structs.AllocMetric{}
// we should have an entry for every node that is looked up
// (potentially with a nil value). if we don't, something must be wrong
s.logger.Error("failed to locate node feasibility information",
"node-id", node.ID, "task_group", tgName)
// provide a stubbed metric to work with
metrics = &structs.AllocMetric{}

I checked coverage and there's no test case that hits this code path. Are we sure it's reachable?

type taskGroupNodes []*taskGroupNode

// feasible returns all taskGroupNode that are feasible for placement
func (t taskGroupNodes) feasible() (feasibleNodes []*taskGroupNode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total number of nodes may be large (100s or 1000s). We probably should initialize the capacity here so that we're not having to perform a bunch of small reallocations (ex https://go.dev/play/p/EEMwoeGObjs)

return result
}

// cancelUnneededServiceDeployments cancels any deployment that is not needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// cancelUnneededServiceDeployments cancels any deployment that is not needed.
// cancelUnneededSystemDeployments cancels any deployment that is not needed.

@pkazmierczak pkazmierczak merged commit 00e69d9 into main Nov 4, 2025
40 checks passed
@pkazmierczak pkazmierczak deleted the f-system-deployments-canaries-evict-refactor branch November 4, 2025 14:10
@ivan-kiselev
Copy link

👏👏

@pkazmierczak
Copy link
Contributor Author

we can mop-up and remaining minor issues

I was about to do some mopping but I saw Chris has some interesting ideas on how to refactor the whole feasibility-check-before-placement. I want to talk to him before we do any changes on main, perhaps it'll be a larger shift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants