Recovering S3 Data Lost due to wrong retention period configuration #10240

jnavarrof · 2024-12-14T14:23:57Z

jnavarrof
Dec 14, 2024

Hello everyone,

During a recent rollout of our Grafana Mimir setup, we accidentally triggered a process that wiped out a significant portion of our data. By mistake, we updated the default compactor_blocks_retention_period from 1y to 3m, assuming this would set a 3-month retention period. Two major tenants did not have a retention period explicitly defined (intended to be 1 year), so the default configuration applied, and compactors removed all their data. From what we understand, 3m does not seem to represent 3 months, as all the data was deleted.

We noticed the issue 3 hours after the rollout. Unfortunately, in Mimir OSS 2.11, compactors are configured to hard delete data after 2 hours, so the data was efficiently removed. We stopped the compactors and fixed the configuration, setting the default retention period back to 1y for all tenants. Additionally, we increased the deletion delay time from 2 hours to 240 hours (10 days) while attempting to recover the data.

With versioning enabled on S3, we managed to recover much of the data by removing delete markers for objects belonging to the two affected tenants. We filtered based on objects modified around December 12th (the day of deletion) and skipped delete-mark.json files used as soft deletion markers by the compactors. While we recovered a substantial amount of data, there is still a significant gap between last week, September, and December 12th, and we’re unsure if further recovery is possible.

We are continuing to recover blocks from the S3 bucket, hoping the compactors can identify and rebuild the bucket index with the "right blocks." We’d greatly appreciate any guidance, relevant documentation, or suggestions on additional steps we can take to fully recover the data.

Many thanks!

jnavarrof · 2024-12-16T09:40:00Z

jnavarrof
Dec 16, 2024
Author

A quick update on this:

We successfully recovered data for one of the tenants and can now see a complete list of 24-hour blocks covering the period from December last year to December this year. Using the listblocks command, we verified that there are no gaps in the blocks, and all days are present:

listblocks -backend=s3 -s3.endpoint ENDPOINT -s3.bucket-name BUCKET -user USER -show-stats

However, despite the blocks being present, we still observe a complete gap in metrics for this tenant during the same period. It seems the compactors are running as expected and performing housekeeping tasks, but the data remains unavailable.

0 replies

jnavarrof · 2024-12-16T12:27:37Z

jnavarrof
Dec 16, 2024
Author

Finally, we identified a missing step in the document Results Cache Needs Flushing. Once the results cache was flushed, all data became available

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering S3 Data Lost due to wrong retention period configuration #10240

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Recovering S3 Data Lost due to wrong retention period configuration #10240

jnavarrof Dec 14, 2024

Replies: 2 comments

jnavarrof Dec 16, 2024 Author

jnavarrof Dec 16, 2024 Author

jnavarrof
Dec 14, 2024

jnavarrof
Dec 16, 2024
Author

jnavarrof
Dec 16, 2024
Author