-
Notifications
You must be signed in to change notification settings - Fork 21
Proposal: Early Compaction of Stale Series from the Head Block #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ganesh Vernekar <[email protected]>
ebbfe83
to
11dd563
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
Some questions/suggestions.
I think we can start with tracking those stale series via a metric #55 (comment).
For the rest of the changes, If it's easy to put together, having a PoC will be really helpful to see clearer and start gathering meaningful measurements.
|
||
**Part 1** | ||
|
||
At a regular interval (say 15 mins), we check if the stale series have crossed p% of the total series. If it has, we trigger a compaction that simply flushes these stale series into a block and removes it from the Head block (can be more than one block if the series crosses the block boundary). We skip WAL truncation and m-map files truncation at this stage and let the usual compaction cycle handle it. How we drop these compacted series during WAL replay is TBD during implementation (may need a new WAL record or use tombstone records). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the blocks be overlapping and merged during a normal compaction? we'd also need to take the merging overhead into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on the config. By default Prometheus does merge overlapping blocks, but that can be disabled as well. The biggest overhead is on instant queries that query the current time range - now it goes from just reading from memory, to memory + blocks on disk, whenever stale series compaction happens. Results are in the PoC here.
Just noticed the feedback @machine424, thanks! I will respond to them soon. In the meantime, I did a PoC on this and here are the results prometheus/prometheus#16929 (comment) I am adding stale series metrics in prometheus/prometheus#16925 which I will finish soon |
The stale series tracking part is ready for review at prometheus/prometheus#16925 Fairly straightforward that should not block on any designing (considers only stale samples for now). |
@codesome Having used the similar early head compaction in Mimir, this is nice to see. Even with early compaction though, we still have this period of time when the old and new series are both in memory, which can lead to large spikes in resource usage, even if they're temporary. For the use case you described, where a rollout happens and some new series are sent that directly replace some old series, it would be great if Prometheus could be made to understand which new series replace which old series, so that fewer resources would be needed internally to track them both (in theory there should be no overlap in samples between two series). This could take the shape of a separate API that allows prometheus to be made aware of some relabeling before the new series are pushed. Is this something you've thought about? |
Prometheus already handles directly replacement series by matching the labels and computing the same internal series ID. It simply can mark the series as not stale. |
@SuperQ I was thinking of something slightly different based on the scenario described in this proposal. For example, after a rollout, some new series could be created, ex: I suspect this is a hard problem since replacements may not always be 1:1, but given the resource spikes that can happen when large numbers of series churn, I thought it was worth mentioning. |
Unfortunately, what you are proposing won't work. Those are different series. New instances of processes need to be separated, otherwise you can end up with signal attribution that should not happen. I get what you're trying to do, but it's not workable in reality. This proposal solves the "GC" problem that occurs when large numbers of metrics churn. There are also other proposals we are working on that will further improve things without the need for magic. |
I have got the code for this in a ready state now prometheus/prometheus#16929 which I will test it in our prod traffic |
Signed-off-by: Ganesh Vernekar <[email protected]>
@machine424 @SuperQ I have updated the proposal based on the feedback and also added a solution to the WAL replay (which I have implemented in prometheus/prometheus#16929). |
Do you mind a quick rebase? We just fixed CI. |
Signed-off-by: Ganesh Vernekar <[email protected]>
@machine424 @SuperQ do you have any more comments on this? cc @bboreham |
For prometheus/prometheus#13616