You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Note that prometheus-private has been renamed to mimir-prometheus, so that's the same repo)
What did you expect to see?
We expected the new version to replay the WAL successfully.
What did you see instead? Under which circumstances?
After updating the 9 ingesters, 2 of them encountered a WAL corruption during startup. I have checked the diff between the two used Prometheus versions, and the only modifications to the WAL which I can see are in this PR, but I don't know how this change would lead to a corruption during replay.
The customer also reported that the WAL replay took much longer than usually on the other 7 ingesters which came back successfully, I'm not sure if that's relevant at all to this issue.
I only have screenshots of the logs.
This is the WAL corruption in the ingester log:
This is from the log of one of the ingesters which came up successfully, note that the reported replay duration is 2h22min, the customer said that usually restarting an ingester took 5-10min, not 2h22min:
Side-note: This could also simply indicate an issue with the used storage volumes, but I still wanted to submit this issue to check if somebody might be aware of a change that could lead to this.
The text was updated successfully, but these errors were encountered:
replay
changed the title
On-prem customer reported that after upgrading 9 Ingesters from GEM v1.7.0 -> v2.0.1 2/9 encountered a WAL corruption during startup
WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1
Jun 15, 2022
What did you do?
The customer upgraded 9 GEM ingesters from v1.7.0 to v2.0.1, the corresponding Prometheus version are:
v1.7.0: github.com/grafana/prometheus-private v0.0.0-20211105104652-a882d28d367e
v2.0.1: github.com/grafana/mimir-prometheus v0.0.0-20220210151959-f8e3195f7500
(Note that prometheus-private has been renamed to mimir-prometheus, so that's the same repo)
What did you expect to see?
We expected the new version to replay the WAL successfully.
What did you see instead? Under which circumstances?
After updating the 9 ingesters, 2 of them encountered a WAL corruption during startup. I have checked the diff between the two used Prometheus versions, and the only modifications to the WAL which I can see are in this PR, but I don't know how this change would lead to a corruption during replay.
The customer also reported that the WAL replay took much longer than usually on the other 7 ingesters which came back successfully, I'm not sure if that's relevant at all to this issue.
I only have screenshots of the logs.
This is the WAL corruption in the ingester log:
This is from the log of one of the ingesters which came up successfully, note that the reported replay duration is
2h22min
, the customer said that usually restarting an ingester took5-10min
, not2h22min
:Side-note: This could also simply indicate an issue with the used storage volumes, but I still wanted to submit this issue to check if somebody might be aware of a change that could lead to this.
The text was updated successfully, but these errors were encountered: