Recovering S3 Data Lost due to wrong retention period configuration #10240
Replies: 2 comments
-
A quick update on this: We successfully recovered data for one of the tenants and can now see a complete list of 24-hour blocks covering the period from December last year to December this year. Using the listblocks command, we verified that there are no gaps in the blocks, and all days are present:
However, despite the blocks being present, we still observe a complete gap in metrics for this tenant during the same period. It seems the compactors are running as expected and performing housekeeping tasks, but the data remains unavailable. |
Beta Was this translation helpful? Give feedback.
-
Finally, we identified a missing step in the document Results Cache Needs Flushing. Once the results cache was flushed, all data became available |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
During a recent rollout of our Grafana Mimir setup, we accidentally triggered a process that wiped out a significant portion of our data. By mistake, we updated the default
compactor_blocks_retention_period
from1y
to3m
, assuming this would set a 3-month retention period. Two major tenants did not have a retention period explicitly defined (intended to be 1 year), so the default configuration applied, and compactors removed all their data. From what we understand,3m
does not seem to represent 3 months, as all the data was deleted.We noticed the issue 3 hours after the rollout. Unfortunately, in Mimir OSS 2.11, compactors are configured to hard delete data after 2 hours, so the data was efficiently removed. We stopped the compactors and fixed the configuration, setting the default retention period back to
1y
for all tenants. Additionally, we increased the deletion delay time from 2 hours to 240 hours (10 days) while attempting to recover the data.With versioning enabled on S3, we managed to recover much of the data by removing delete markers for objects belonging to the two affected tenants. We filtered based on objects modified around December 12th (the day of deletion) and skipped delete-mark.json files used as soft deletion markers by the compactors. While we recovered a substantial amount of data, there is still a significant gap between last week, September, and December 12th, and we’re unsure if further recovery is possible.
We are continuing to recover blocks from the S3 bucket, hoping the compactors can identify and rebuild the bucket index with the "right blocks." We’d greatly appreciate any guidance, relevant documentation, or suggestions on additional steps we can take to fully recover the data.
Many thanks!
Beta Was this translation helpful? Give feedback.
All reactions