WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

replay · 2022-06-15T16:06:36Z

What did you do?

The customer upgraded 9 GEM ingesters from v1.7.0 to v2.0.1, the corresponding Prometheus version are:

v1.7.0: github.com/grafana/prometheus-private v0.0.0-20211105104652-a882d28d367e
v2.0.1: github.com/grafana/mimir-prometheus v0.0.0-20220210151959-f8e3195f7500

(Note that prometheus-private has been renamed to mimir-prometheus, so that's the same repo)

What did you expect to see?

We expected the new version to replay the WAL successfully.

What did you see instead? Under which circumstances?

After updating the 9 ingesters, 2 of them encountered a WAL corruption during startup. I have checked the diff between the two used Prometheus versions, and the only modifications to the WAL which I can see are in this PR, but I don't know how this change would lead to a corruption during replay.

The customer also reported that the WAL replay took much longer than usually on the other 7 ingesters which came back successfully, I'm not sure if that's relevant at all to this issue.

I only have screenshots of the logs.

This is the WAL corruption in the ingester log:

This is from the log of one of the ingesters which came up successfully, note that the reported replay duration is 2h22min, the customer said that usually restarting an ingester took 5-10min, not 2h22min:

Side-note: This could also simply indicate an issue with the used storage volumes, but I still wanted to submit this issue to check if somebody might be aware of a change that could lead to this.

The text was updated successfully, but these errors were encountered:

replay changed the title ~~On-prem customer reported that after upgrading 9 Ingesters from GEM v1.7.0 -> v2.0.1 2/9 encountered a WAL corruption during startup~~ WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

replay commented Jun 15, 2022 •

edited

Loading

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

Comments

replay commented Jun 15, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

replay commented Jun 15, 2022 •

edited

Loading