[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

barkbay · 2024-11-27T07:08:17Z

We already had this failure in the past (#5040 and #5795) but it was quiet until yesterday:

2024-11-26 - TestMutationResizeMemoryDown on Kind 1.29

{
    "Time": "2024-11-26T02:10:27.650635375Z",
    "Action": "output",
    "Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/es",
    "Test": "TestMutationResizeMemoryDown/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process",
    "Output": "    steps_mutation.go:159: ContinuousHealthChecks failures count (10) is above the tolerance (0): 10 x [cluster health red]\n"
}

2024-11-27 - TestMutationAndReversal on Kind 1.27.17

{
    "Time": "2024-11-27T02:41:29.251348422Z",
    "Action": "output",
    "Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/es",
    "Test": "TestMutationAndReversal/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process",
    "Output": "    steps_mutation.go:159: ContinuousHealthChecks failures count (11) is above the tolerance (0): 11 x [cluster health red]\n"
}

I had a quick look at the TestMutationAndReversal failure and nothing really stands out for now, the cluster was eventually green.

The text was updated successfully, but these errors were encountered:

barkbay · 2024-12-02T07:49:36Z

Trying to make sense of the events for this failure :

Timestamp	Source	Message
`02:29:25.020`	E2E	`TestMutationAndReversal/Starting_to_watch_for_correct_node_shutdown_API_usage`
`02:29:25.948`	E2E	`TestMutationAndReversal/Applying_the_Elasticsearch_mutation_should_succeed`
`02:29:26.387`	ECK	`Requesting shutdown node: test-reverted-mutation-25tq-es-masterdata-2` note that masterdata-2 is the elected master at this point
`02:29:26.401`	ES	`test-reverted-mutation-25tq-es-masterdata-2` - `creating shutdown record {nodeId=[M1SoknqJRn-azlIWOql3jg], type=[RESTART], reason=[123067]}`
`02:29:26.462`	ES	`test-reverted-mutation-25tq-es-masterdata-2` - `marking node as not ready because it's shutting down`
`02:29:26.464`	ES	`test-reverted-mutation-25tq-es-masterdata-2` - `took [1ms] to notify listeners on successful publication of cluster state (version: 80, uuid: NlqSkGGzQO-lLXOCWgtVIg) for [put-node-shutdown-M1SoknqJRn-azlIWOql3jg[PutShutdownNodeTask[request=org.elasticsearch.xpack.shutdown.PutShutdownN`
`02:29:26.489`	ECK	`Node shutdown complete, can start node deletion`
`02:29:26.999`	ES	`test-reverted-mutation-25tq-es-masterdata-1` - `transport connection to [{test-reverted-mutation-25tq-es-masterdata-2}.... closed by remote`
		no elected master from `03:29:26.999` to `02:29:27.208`?
`02:29:27.208`	ES	`test-reverted-mutation-25tq-es-masterdata-1` - `master node changed {previous [], current [{test-reverted-mutation-25tq-es-masterdata-0}{5_A7oYdMRaOAkef3P0rXhA}{2l0Fm1KqQcmKD2PmczNuHQ}{test-reverted-mutation-25tq-es-masterdata-0}{10.244.2.68}{10.244.2.68:9300}{cdfhilmrstw}{8.16.0}{`
`02:29:27.327`	ES	`test-reverted-mutation-25tq-es-masterdata-0` - `Cluster health status changed from [YELLOW] to [RED] Cluster health status changed from [YELLOW] to [RED] (reason: [{test-reverted-mutation-25tq-es-masterdata-2}{M1SoknqJRn-azlIWOql3jg}{UN68ohjnSiSbhA_9LcjJyQ}{test-reverted-mutation-25tq-es-masterdata-2}{10.244.3.48}{10.244.3.48:9300}{cdfhilmrstw}{8.16.0}{7000099-8518000} reason: disconnected]).`
`02:29:57.817`	ES	`test-reverted-mutation-25tq-es-masterdata-2 started {test-reverted-mutation-25tq-es-masterdata-2}`
`02:29:58.521`	ES	`test-reverted-mutation-25tq-es-masterdata-0` - `Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ds-ilm-history-7-2024.12.02-000001][0]]]).`

barkbay · 2024-12-02T08:42:46Z

From the logs above I'm a bit surprised that we had no election while masterdata-2 was still around? It seems that the other nodes started a new election only when masterdata-2 was terminated and the connection was lost.

pebrc · 2024-12-02T09:28:20Z

I removed my previous post. I cut off the routing table in a bad place. There was still a shard assigned.

barkbay added the >test Related to unit/integration/e2e tests label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

barkbay commented Nov 27, 2024

barkbay commented Dec 2, 2024 •

edited

Loading

barkbay commented Dec 2, 2024 •

edited

Loading

pebrc commented Dec 2, 2024

[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

Comments

barkbay commented Nov 27, 2024

2024-11-26 - TestMutationResizeMemoryDown on Kind 1.29

2024-11-27 - TestMutationAndReversal on Kind 1.27.17

barkbay commented Dec 2, 2024 • edited Loading

barkbay commented Dec 2, 2024 • edited Loading

pebrc commented Dec 2, 2024

barkbay commented Dec 2, 2024 •

edited

Loading

barkbay commented Dec 2, 2024 •

edited

Loading