Proposal: Early Compaction of Stale Series from the Head Block #55

codesome · 2025-07-04T00:10:22Z

proposals/0055-stale-series-compaction.md

Signed-off-by: Ganesh Vernekar <[email protected]>

machine424

Thanks for this.
Some questions/suggestions.
I think we can start with tracking those stale series via a metric #55 (comment).

For the rest of the changes, If it's easy to put together, having a PoC will be really helpful to see clearer and start gathering meaningful measurements.

proposals/0055-stale-series-compaction.md

machine424 · 2025-07-15T09:56:25Z

proposals/0055-stale-series-compaction.md

+
+**Part 1**
+
+At a regular interval (say 15 mins), we check if the stale series have crossed p% of the total series. If it has, we trigger a compaction that simply flushes these stale series into a block and removes it from the Head block (can be more than one block if the series crosses the block boundary). We skip WAL truncation and m-map files truncation at this stage and let the usual compaction cycle handle it. How we drop these compacted series during WAL replay is TBD during implementation (may need a new WAL record or use tombstone records).


would the blocks be overlapping and merged during a normal compaction? we'd also need to take the merging overhead into account.

That depends on the config. By default Prometheus does merge overlapping blocks, but that can be disabled as well. The biggest overhead is on instant queries that query the current time range - now it goes from just reading from memory, to memory + blocks on disk, whenever stale series compaction happens. Results are in the PoC here.

codesome · 2025-07-29T03:31:33Z

Just noticed the feedback @machine424, thanks! I will respond to them soon.

In the meantime, I did a PoC on this and here are the results prometheus/prometheus#16929 (comment)

I am adding stale series metrics in prometheus/prometheus#16925 which I will finish soon

codesome · 2025-08-06T23:38:05Z

The stale series tracking part is ready for review at prometheus/prometheus#16925

Fairly straightforward that should not block on any designing (considers only stale samples for now).

jhalterman · 2025-08-12T00:10:26Z

@codesome Having used the similar early head compaction in Mimir, this is nice to see.

Even with early compaction though, we still have this period of time when the old and new series are both in memory, which can lead to large spikes in resource usage, even if they're temporary. For the use case you described, where a rollout happens and some new series are sent that directly replace some old series, it would be great if Prometheus could be made to understand which new series replace which old series, so that fewer resources would be needed internally to track them both (in theory there should be no overlap in samples between two series). This could take the shape of a separate API that allows prometheus to be made aware of some relabeling before the new series are pushed. Is this something you've thought about?

SuperQ · 2025-08-12T05:14:35Z

Prometheus already handles directly replacement series by matching the labels and computing the same internal series ID. It simply can mark the series as not stale.

jhalterman · 2025-08-12T05:42:11Z

@SuperQ I was thinking of something slightly different based on the scenario described in this proposal. For example, after a rollout, some new series could be created, ex: foo{pod="bar2"} which effectively replaces foo{pod="bar1"}. At present, even with early compaction, we'd have 2 series in memory for some time. But if we could communicate that something churned via an API, and that one series replaces another, perhaps there could be some savings.

I suspect this is a hard problem since replacements may not always be 1:1, but given the resource spikes that can happen when large numbers of series churn, I thought it was worth mentioning.

SuperQ · 2025-08-12T06:13:29Z

Unfortunately, what you are proposing won't work.

Those are different series. New instances of processes need to be separated, otherwise you can end up with signal attribution that should not happen.

I get what you're trying to do, but it's not workable in reality.

This proposal solves the "GC" problem that occurs when large numbers of metrics churn.

There are also other proposals we are working on that will further improve things without the need for magic.

codesome · 2025-08-25T22:16:50Z

I have got the code for this in a ready state now prometheus/prometheus#16929 which I will test it in our prod traffic

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome · 2025-08-29T00:04:40Z

@machine424 @SuperQ I have updated the proposal based on the feedback and also added a solution to the WAL replay (which I have implemented in prometheus/prometheus#16929).

bwplotka · 2025-09-09T10:56:28Z

Do you mind a quick rebase? We just fixed CI.

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome · 2025-09-11T01:48:54Z

@machine424 @SuperQ do you have any more comments on this? cc @bboreham

codesome requested review from bboreham, bwplotka and jesusvazquez July 4, 2025 00:12

codesome mentioned this pull request Jul 4, 2025

Eager compaction of stale series prometheus/prometheus#13616

Open

SuperQ reviewed Jul 4, 2025

View reviewed changes

proposals/0055-stale-series-compaction.md Outdated Show resolved Hide resolved

Proposal: Early Compaction of Stale Series from the Head Block

11dd563

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the codesome/stale-series-compaction branch from ebbfe83 to 11dd563 Compare July 8, 2025 19:21

machine424 reviewed Jul 15, 2025

View reviewed changes

This was referenced Jul 25, 2025

tsdb: Track stale series in the Head block based on stale sample prometheus/prometheus#16925

Merged

tsdb: Early compaction of stale series prometheus/prometheus#16929

Open

Update proposal based on feedback and add WAL replay logic

7cbf88e

Signed-off-by: Ganesh Vernekar <[email protected]>

bwplotka added the proposal label Sep 9, 2025

codesome added 2 commits September 10, 2025 09:38

Merge branch 'main' into codesome/stale-series-compaction

51d3bcb

make fmt

b3e100a

Signed-off-by: Ganesh Vernekar <[email protected]>


		Part 1

		At a regular interval (say 15 mins), we check if the stale series have crossed p% of the total series. If it has, we trigger a compaction that simply flushes these stale series into a block and removes it from the Head block (can be more than one block if the series crosses the block boundary). We skip WAL truncation and m-map files truncation at this stage and let the usual compaction cycle handle it. How we drop these compacted series during WAL replay is TBD during implementation (may need a new WAL record or use tombstone records).

Proposal: Early Compaction of Stale Series from the Head Block #55

Are you sure you want to change the base?

Proposal: Early Compaction of Stale Series from the Head Block #55

Uh oh!

Conversation

codesome commented Jul 4, 2025

Uh oh!

Uh oh!

machine424 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machine424 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

codesome Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

codesome commented Jul 29, 2025

Uh oh!

codesome commented Aug 6, 2025

Uh oh!

jhalterman commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperQ commented Aug 12, 2025

Uh oh!

jhalterman commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperQ commented Aug 12, 2025

Uh oh!

codesome commented Aug 25, 2025

Uh oh!

codesome commented Aug 29, 2025

Uh oh!

bwplotka commented Sep 9, 2025

Uh oh!

codesome commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

machine424 left a comment •

edited

Loading

jhalterman commented Aug 12, 2025 •

edited

Loading

jhalterman commented Aug 12, 2025 •

edited

Loading