Upgrade masters last when upgrading ES clusters #8871

naemono · 2025-10-22T18:38:05Z

Fixes #8429

What is changing?

This ensure that the master StatefulSets are always upgraded last when doing a version upgrade of Elasticsearch.

Validation

Manual testing
Unit test for behavior
e2e test validating behavior

Signed-off-by: Michael Montgomery <[email protected]>

prodsecmachine · 2025-10-22T18:38:22Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Signed-off-by: Michael Montgomery <[email protected]>

naemono · 2025-10-29T01:58:29Z

buildkite test this -f p=kind,t=TestNonMasterFirstUpgradeComplexTopology -m s=9.1.2

Signed-off-by: Michael Montgomery <[email protected]>

updated. Signed-off-by: Michael Montgomery <[email protected]>

Signed-off-by: Michael Montgomery <[email protected]>

…/cloud-on-k8s into fix-sts-upgrade-issue-recreation

Signed-off-by: Michael Montgomery <[email protected]>

naemono · 2025-10-29T20:23:53Z

buildkite test this -f p=kind,t=TestHandleUpscaleAndSpecChanges_VersionUpgradeDataFirstFlow -m s=9.1.2

Signed-off-by: Michael Montgomery <[email protected]>

Copilot

Pull Request Overview

This PR implements a non-master-first upgrade strategy for Elasticsearch clusters. The key change ensures that during version upgrades, non-master nodes (data, ingest, coordinating nodes) are upgraded before master nodes, which helps maintain cluster stability during upgrades.

Adds logic to separate master and non-master StatefulSets during version upgrades
Implements upgrade order validation to ensure non-master nodes complete their upgrades first
Adds comprehensive unit and e2e tests to verify the upgrade flow

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
pkg/controller/elasticsearch/driver/upgrade.go	Adds check to identify new clusters vs upgrades by checking if status version is empty
pkg/controller/elasticsearch/driver/upscale.go	Implements non-master-first upgrade logic with resource separation and upgrade status checking
pkg/controller/elasticsearch/driver/upscale_test.go	Adds comprehensive unit test for version upgrade flow and minor formatting fixes
test/e2e/es/non_master_first_upgrade_test.go	Adds e2e test that validates non-master-first upgrade behavior with a watcher

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/es/non_master_first_upgrade_test.go

Signed-off-by: Michael Montgomery <[email protected]>

pkg/controller/elasticsearch/driver/upscale.go

pkg/controller/elasticsearch/driver/upscale_test.go

pkg/controller/elasticsearch/driver/upscale.go

Signed-off-by: Michael Montgomery <[email protected]>

barkbay · 2025-12-02T09:09:32Z

I will take another look at it today.

Edit: was not able to make it today, will do tomorrow first thing

pkg/controller/elasticsearch/driver/upscale.go

barkbay · 2025-12-03T07:48:56Z

pkg/controller/elasticsearch/driver/upscale.go

+	// The only adjustment we want to make to master statefulSets before ensuring that all non-master
+	// statefulSets have been reconciled is to potentially scale up the replicas
+	// which should happen 1 at a time as we adjust the replicas early.
+	if err = maybeUpscaleMasterResources(ctx, masterResources); err != nil {


I just realized that calling this when len(nonMasterResources) == 0 (or more generally, when all non-master nodesets have already been upgraded?) can be slightly suboptimal.

Assuming that the initial state is:

apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: elasticsearch-sample spec: version: 9.1.0 nodeSets: - name: default config: node.roles: ["master", "data", "ingest", "ml"] node.store.allow_mmap: false count: 3

If we update and upgrade to:

apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: elasticsearch-sample spec: version: 9.1.2 nodeSets: - name: default config: node.roles: ["master", "data", "ingest", "ml"] node.store.allow_mmap: false count: 4

Then we are going to scale up the 9.1.0 statefulset, leading to the creation of elasticsearch-sample-es-default-3, but immediately in the next reconciliation we are going to delete elasticsearch-sample-es-default-3 to upgrade it to 9.1.2

My previous comment made me wonder if !isVersionUpgrade is actually the only reason we might want to reconcile everything at once.

Is this 4b966db what are were intending from this comment @barkbay ?

barkbay · 2025-12-03T08:52:37Z

buildkite test this -f p=kind,t=TestNonMasterFirstUpgradeComplexTopology -m s=8.15.2

barkbay · 2025-12-03T08:55:40Z

test/e2e/es/non_master_first_upgrade_test.go

+		func(k *test.K8sClient, t *testing.T) {
+			statefulSets, err := essset.RetrieveActualStatefulSets(k.Client, k8s.ExtractNamespacedName(&es))
+			if err != nil {
+				t.Logf("failed to get StatefulSets: %s", err.Error())


Will this test fail if we consistently get an error here? (my feeling is that it's not going to be the case because violations is always empty in that case, but maybe I'm missing something)

No it's not. I wonder what would be the limit we set on errors in the watcher where it would fail? There are 4 instances I can find where we don't fail in these watchers in the e2e tests currently:

cloud-on-k8s/test/e2e/kb/version_upgrade_test.go

Line 130 in b43765b

t.Logf("got error: %v", err)

cloud-on-k8s/test/e2e/kb/version_upgrade_test.go

Line 240 in b43765b

t.Logf("got error: %v", err)

cloud-on-k8s/test/e2e/kb/version_upgrade_test.go

Line 275 in b43765b

t.Logf("failed to list pods: %v", err)

cloud-on-k8s/test/e2e/test/elasticsearch/checks_budget.go

Line 33 in b43765b

t.Logf("got error listing pods: %v", err)

barkbay

Almost LGTM, I think we need to adjust the way we scale the master nodes, also the e2e test seems broken (we create the data integrity index with no replicas, which should fail during a rolling upgrade), and may not be accurate in case of errors.

test/e2e/es/non_master_first_upgrade_test.go

Update upscaleResults with results of upscale. Signed-off-by: Michael Montgomery <[email protected]>

Signed-off-by: Michael Montgomery <[email protected]>

naemono · 2025-12-11T16:51:25Z

buildkite test this -f p=kind,t=TestNonMasterFirstUpgradeComplexTopology -m s=8.15.2

Upgrade masters last when upgrading ES clusters

9db32d0

Signed-off-by: Michael Montgomery <[email protected]>

botelastic bot added the triage label Oct 22, 2025

naemono added 2 commits October 23, 2025 14:03

Fix lint issue

39b2702

Signed-off-by: Michael Montgomery <[email protected]>

Add e2e test for upgrade order.

50b3954

Signed-off-by: Michael Montgomery <[email protected]>

naemono added 5 commits October 28, 2025 21:00

unexport things in e2e tests

00555c2

Signed-off-by: Michael Montgomery <[email protected]>

Also look at the current/target version while determining whether sts is

88cb347

updated. Signed-off-by: Michael Montgomery <[email protected]>

Fix tests

790d3f1

Signed-off-by: Michael Montgomery <[email protected]>

Merge branch 'fix-sts-upgrade-issue-recreation' of github.com:naemono…

4b944d1

…/cloud-on-k8s into fix-sts-upgrade-issue-recreation

Fix the unit tests for master last upgrades

d9885ba

Signed-off-by: Michael Montgomery <[email protected]>

naemono added 5 commits October 29, 2025 18:03

fix linter

efa8643

Signed-off-by: Michael Montgomery <[email protected]>

move closer to use.

6914708

Signed-off-by: Michael Montgomery <[email protected]>

Ensure requeue

2dc664b

Signed-off-by: Michael Montgomery <[email protected]>

adjust comments

46c726c

Signed-off-by: Michael Montgomery <[email protected]>

Adjust logging in e2e test

fccf6c3

Signed-off-by: Michael Montgomery <[email protected]>

naemono changed the title ~~WIP: Upgrade masters last when upgrading ES clusters~~ Upgrade masters last when upgrading ES clusters Oct 29, 2025

naemono marked this pull request as ready for review October 29, 2025 23:21

naemono requested review from barkbay, Copilot, kvalliyurnatt, pebrc and rhr323 October 29, 2025 23:21

Copilot AI reviewed Oct 29, 2025

View reviewed changes

test/e2e/es/non_master_first_upgrade_test.go Outdated Show resolved Hide resolved

test/e2e/es/non_master_first_upgrade_test.go Show resolved Hide resolved

naemono added 2 commits October 30, 2025 15:21

Don't compare masters against other masters or themselves.

8feef24

Signed-off-by: Michael Montgomery <[email protected]>

Fix spelling

0f5a31a

Signed-off-by: Michael Montgomery <[email protected]>

barkbay added the >bug Something isn't working label Nov 3, 2025

botelastic bot removed the triage label Nov 3, 2025

barkbay added the v3.3.0 (next) label Nov 3, 2025

pebrc reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

pebrc reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Show resolved Hide resolved

pebrc reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale_test.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale_test.go Outdated Show resolved Hide resolved

pebrc reviewed Nov 26, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

naemono added 3 commits November 30, 2025 19:35

Apply review comments

645e088

Signed-off-by: Michael Montgomery <[email protected]>

More review comments

3e95b2d

Signed-off-by: Michael Montgomery <[email protected]>

Ensure the StatefulSet controller has observed the latest generation.

7f00388

Signed-off-by: Michael Montgomery <[email protected]>

naemono requested review from barkbay and pebrc December 1, 2025 13:20

barkbay reviewed Dec 3, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Dec 3, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Dec 3, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved

barkbay reviewed Dec 3, 2025

View reviewed changes

pkg/controller/elasticsearch/driver/upscale.go Show resolved Hide resolved

barkbay reviewed Dec 3, 2025

View reviewed changes

test/e2e/es/non_master_first_upgrade_test.go Outdated Show resolved Hide resolved

naemono added 4 commits December 11, 2025 10:14

Use reconcileStatefulSet directly

5142908

Update upscaleResults with results of upscale. Signed-off-by: Michael Montgomery <[email protected]>

Add comment for 404

4afe2e4

Signed-off-by: Michael Montgomery <[email protected]>

Only attempt upscale of masters when there are non-masters

4b966db

Signed-off-by: Michael Montgomery <[email protected]>

Adjust e2e test

6321a96

Signed-off-by: Michael Montgomery <[email protected]>

naemono requested a review from barkbay December 11, 2025 19:20

Upgrade masters last when upgrading ES clusters #8871

Are you sure you want to change the base?

Upgrade masters last when upgrading ES clusters #8871

Conversation

naemono commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is changing?

Validation

Uh oh!

prodsecmachine commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

naemono commented Oct 29, 2025

Uh oh!

naemono commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barkbay commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barkbay Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

barkbay Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

naemono Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barkbay commented Dec 3, 2025

Uh oh!

barkbay Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

naemono Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

barkbay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naemono commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

naemono commented Oct 22, 2025 •

edited

Loading

prodsecmachine commented Oct 22, 2025 •

edited

Loading

barkbay commented Dec 2, 2025 •

edited

Loading

naemono Dec 11, 2025 •

edited

Loading