Skip to content

Conversation

codesome
Copy link
Member

@codesome codesome commented Jul 4, 2025

@codesome codesome force-pushed the codesome/stale-series-compaction branch from ebbfe83 to 11dd563 Compare July 8, 2025 19:21
Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.
Some questions/suggestions.
I think we can start with tracking those stale series via a metric #55 (comment).

For the rest of the changes, If it's easy to put together, having a PoC will be really helpful to see clearer and start gathering meaningful measurements.


**Part 1**

At a regular interval (say 15 mins), we check if the stale series have crossed p% of the total series. If it has, we trigger a compaction that simply flushes these stale series into a block and removes it from the Head block (can be more than one block if the series crosses the block boundary). We skip WAL truncation and m-map files truncation at this stage and let the usual compaction cycle handle it. How we drop these compacted series during WAL replay is TBD during implementation (may need a new WAL record or use tombstone records).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the blocks be overlapping and merged during a normal compaction? we'd also need to take the merging overhead into account.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the config. By default Prometheus does merge overlapping blocks, but that can be disabled as well. The biggest overhead is on instant queries that query the current time range - now it goes from just reading from memory, to memory + blocks on disk, whenever stale series compaction happens. Results are in the PoC here.

@codesome
Copy link
Member Author

Just noticed the feedback @machine424, thanks! I will respond to them soon.

In the meantime, I did a PoC on this and here are the results prometheus/prometheus#16929 (comment)

I am adding stale series metrics in prometheus/prometheus#16925 which I will finish soon

@codesome
Copy link
Member Author

codesome commented Aug 6, 2025

The stale series tracking part is ready for review at prometheus/prometheus#16925

Fairly straightforward that should not block on any designing (considers only stale samples for now).

@jhalterman
Copy link

jhalterman commented Aug 12, 2025

@codesome Having used the similar early head compaction in Mimir, this is nice to see.

Even with early compaction though, we still have this period of time when the old and new series are both in memory, which can lead to large spikes in resource usage, even if they're temporary. For the use case you described, where a rollout happens and some new series are sent that directly replace some old series, it would be great if Prometheus could be made to understand which new series replace which old series, so that fewer resources would be needed internally to track them both (in theory there should be no overlap in samples between two series). This could take the shape of a separate API that allows prometheus to be made aware of some relabeling before the new series are pushed. Is this something you've thought about?

@SuperQ
Copy link
Member

SuperQ commented Aug 12, 2025

Prometheus already handles directly replacement series by matching the labels and computing the same internal series ID. It simply can mark the series as not stale.

@jhalterman
Copy link

jhalterman commented Aug 12, 2025

@SuperQ I was thinking of something slightly different based on the scenario described in this proposal. For example, after a rollout, some new series could be created, ex: foo{pod="bar2"} which effectively replaces foo{pod="bar1"}. At present, even with early compaction, we'd have 2 series in memory for some time. But if we could communicate that something churned via an API, and that one series replaces another, perhaps there could be some savings.

I suspect this is a hard problem since replacements may not always be 1:1, but given the resource spikes that can happen when large numbers of series churn, I thought it was worth mentioning.

@SuperQ
Copy link
Member

SuperQ commented Aug 12, 2025

Unfortunately, what you are proposing won't work.

Those are different series. New instances of processes need to be separated, otherwise you can end up with signal attribution that should not happen.

I get what you're trying to do, but it's not workable in reality.

This proposal solves the "GC" problem that occurs when large numbers of metrics churn.

There are also other proposals we are working on that will further improve things without the need for magic.

@codesome
Copy link
Member Author

I have got the code for this in a ready state now prometheus/prometheus#16929 which I will test it in our prod traffic

@codesome
Copy link
Member Author

@machine424 @SuperQ I have updated the proposal based on the feedback and also added a solution to the WAL replay (which I have implemented in prometheus/prometheus#16929).

@bwplotka
Copy link
Member

bwplotka commented Sep 9, 2025

Do you mind a quick rebase? We just fixed CI.

@codesome
Copy link
Member Author

@machine424 @SuperQ do you have any more comments on this? cc @bboreham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants