-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make partitions memory group aware #24472
base: dev
Are you sure you want to change the base?
Conversation
4564975
to
1602ba6
Compare
1602ba6
to
419c29b
Compare
src/v/config/configuration.cc
Outdated
, topic_partitions_max_memory_allocation_share( | ||
*this, | ||
"topic_partitions_max_memory_allocation_share", | ||
"foo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
placeholder? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks fixed.
419c29b
to
2262c91
Compare
src/v/config/configuration.cc
Outdated
, topic_partitions_memory_allocation_percent( | ||
*this, | ||
"topic_partitions_memory_allocation_percent", | ||
"Percentage of total memory that is being reserved for topic partitions.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Percentage of total memory that is being reserved for topic partitions.", | |
"Percentage of total memory to reserve for topic partitions.", |
Retry command for Build#59813please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#59813
test results on build#59905
test results on build#59961
|
src/v/resource_mgmt/memory_groups.h
Outdated
@@ -31,6 +31,12 @@ struct compaction_memory_reservation { | |||
double max_limit_pct{100.0}; | |||
}; | |||
|
|||
struct partitions_memory_reservation { | |||
size_t max_limit_pct = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this set to 0, then overritten after construction in memory_groups()? Can't we just set at construction as a designated initializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. We can change it.
bool guess_is_memory_group_aware_memory_limit() { | ||
// Here we are trying to guess whether `topic_memory_per_partition` is | ||
// memory group aware. Originally the property was more like a guess of how | ||
// much memory each partition would statically use. It was first set to 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure it was 1 MiB? We raised it, not lowered it IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah now that you say it that rings a bell. I misremembered.
@@ -81,6 +81,92 @@ allocation_constraints partition_allocator::default_constraints() { | |||
return req; | |||
} | |||
|
|||
bool guess_is_memory_group_aware_memory_limit() { | |||
// Here we are trying to guess whether `topic_memory_per_partition` is | |||
// memory group aware. Originally the property was more like a guess of how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's true. It was a empirical measurement to find an X where 8 GB / X gave a number of partitions that worked on 8 GB instances and 4 GB / X gave a number that worked on 4 GB instances, based on testing on those two instance types (which only really happened incidentially because Arm had some 4 GB instance types we wanted to try).
It was never a guess at the static memory usage of a partition: it was entirely driven by "OOM or not" during tests so included also dynamic memory use which "should" be accounted for in other places (but it it is not always the case).
Just so the history makes some sense (this all makes more sense given the info it was 1 MB not 10 MB: the problem which was noted is that on 4 GB hosts you could not reach the partitios_per_shard value before OOMing, thus this memory setting needed to be raised to lower the limit in that case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
// Note that if we get this guess wrong that isn't immediately fatal. This | ||
// only becomes active when creating new topics or modifying existing | ||
// topics. At that point users can increase the memory reservation if | ||
// needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the very comprehensive comment here
// than 10x smaller. We assume it's unlikely someone would have changed the | ||
// value to be that much smaller and hence guess that all the values larger | ||
// than 2 times the new default are old values. Equally in the new world | ||
// it's unlikely anybody would increase the value (at all really). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the most likely people are ourselves, if we find out that 100K or whatever is not enough, i.e., there is some other significant source of memory usage we didn't account for. If we need to go to 300K say, we can't because we'd be over this threshold and it would have the opposite as expected effect. However in this case we could just tweak the shares for a similar effect? E.g., reduce shares to 33% of their existing value (this is not really exactly the same since it also in theory gives that memory away to other subsystems but in practice based on how this works it seems similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we can always change the default up reallyas the heuristic is only really needed if the value is overriden (at which point we can also reevaluate the heuristic).
Lets say we had to change it to 400K from the current 200K then today I feel like the heuristic as is would still be useful as I don't think anyone would have put it below 1MB (and the caveat that nothing immediately breaks still applies).
clusterlog.warn, | ||
"Refusing to create {} new partitions as total partition count " | ||
"{} " | ||
"would exceed memory limit {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include in the error message more components of the math e.g., memory_limit
and other values so we can reconstruct what happened w/o knowing the conifg values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(yes, I know this is c/p from elsewhere but I think this is even more important now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more. Let me know if you think anything is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
....
clusterlog.warn, | ||
"Refusing to create {} new partitions as total partition count " | ||
"{} " | ||
"would exceed memory limit {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(yes, I know this is c/p from elsewhere but I think this is even more important now)
// had overridden the default when it was non-mmemory group aware. Hence, this | ||
// value should NOT be changed and stay the same even if the (above) default is | ||
// changed. | ||
inline constexpr size_t ORIGINAL_MEMORY_GROUP_AWARE_TMPP = 200_KiB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if we have to raise DEFAULT_TOPIC_MEMORY_PER_PARTITION
2x or more we won't be able to really adhere to this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I guess (see the other comment).
We could also remove this bit of logic I am not sure whether it is overdoing it.
@@ -34,7 +34,6 @@ class partition_allocator { | |||
partition_allocator( | |||
ss::sharded<members_table>&, | |||
ss::sharded<features::feature_table>&, | |||
config::binding<std::optional<size_t>> memory_per_partition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this binding removed? we just hardcode access to the config now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the problem with bindings is that you can't access the properties of a config like (is_overriden
etc).
@@ -149,6 +148,12 @@ class partition_allocator { | |||
const uint64_t new_partitions_replicas_requested, | |||
const model::topic_namespace& topic) const; | |||
|
|||
// sub-routine of the above, checks available memory | |||
std::error_code check_memory_limits( | |||
const uint64_t new_partitions_replicas_requested, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dangling const
here or intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -0,0 +1,32 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us put an INFO-level log line at startup that exposes key info about partition memory calculation, etc? Like you have X memory and this Y p memory setting, and Z shares and W total shares, and parts_per_shard=V so you get this many partitions total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Did you put this comment into this file intentionally?
Bit confused as you mention "Z shares and W total shares" but the partition side doesn't use any shares so wondering whether you think we should print only the partition relevant info (maybe in the partition allocator?) or also all the other memory group info (including shares for other subsystems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StephanDollberg - I intentionally put it a (essentially arbitrary) file because comments in the top level PR seem to get lost? They are treated different than review comments, don't have threaded conversations, etc. Maybe I could just have a conventon of using the first line of the first file, dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit confused as you mention "Z shares and W total shares"
Partitions have 10 shares in memory groups that's what I was referring to. So my comment WXYZ thing was really about memory groups: we should print all the "variable info" from memory groups so we know exactly how the memory was split.
We should also have good logging about partition allocations decisions (and failure reasons), but I think we already have that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partitions have 10 shares in memory groups that's what I was referring to. So my comment WXYZ thing was really about memory groups: we should print all the "variable info" from memory groups so we know exactly how the memory was split.
Yeah ok I was just confused because the partitions one doesn't use "shares" but a percentage reservation.
In any case I have pushed a commit which prints all the final memory group allocations on startup.
Looks like:
... [shard 0:main] main - Per shard memory group allocations: total memory: 4.000GiB, total memory minus pre-share reservations: 3.475GiB, chunk cache: 627.953MiB, kafka: 1.226GiB, rpc: 837.271MiB, recovery: 418.635MiB, tiered storage: 418.635MiB, data transforms: 0.000bytes, compaction: 128.000MiB, datalake: 0.000bytes, partitions: 409.600MiB
We should also have good logging about partition allocations decisions (and failure reasons), but I think we already have that?
Yes the updated failure warning should have all the details now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of changes suggested.
That function is getting kinda big. Move the memory checking part out in preparation for some changes to come.
It's not being watched so no need to use a binding. Also bindings don't allow using any of the advanced property methods (is_overriden etc.).
Static partition memory use is currently unaccounted for in memory groups. This patch introduces a reserve to account for partition memory usage. We don't use a share like some of the other groups but rather an upfront percentage based. This is because we don't want the partitions share to shrink once we add more groups.
Previously `topic_memory_per_partition` was more a rough guess whose value was way too large. This was partly a result of not having a better idea but also because obviously not all memory can be used by partitions. After some analysis via metrics and the memory profiler we now have a better idea of what the real value is. Hence, we aggressively change the value down. At the same time we make the partition_allocator memory limit check memory group aware. It will no longer compare `topic_memory_per_partition` against the total memory as if partitions could use all that memory. Instead it now compares it against the memory reserved for partitions via using the memory groups. We use some heuristics (which are better explained in the code comment) to try to guard against cases where we would make partition density worse. The new defaults are: - topic_memory_per_partition: 200KiB (this already assumes some to be merged optimizations in the seastar metrics stack). It's still fairly conservative. Probably more like 150KiB. - topic_partitions_max_memory_allocation_share: 10% - together with the above this effectively gives us twice the partition density compared to the old calculation and using 4MiB for TMPP
2262c91
to
f820490
Compare
Retry command for Build#59905please wait until all jobs are finished before running the slash command
|
Adds a log to startup printing all the final memory group allocations to the different sub systems. Logs like: ``` ... [shard 0:main] main - Per shard memory group allocations: total memory: 4.000GiB, total memory minus pre-share reservations: 3.475GiB, chunk cache: 627.953MiB, kafka: 1.226GiB, rpc: 837.271MiB, recovery: 418.635MiB, tiered storage: 418.635MiB, data transforms: 0.000bytes, compaction: 128.000MiB, datalake: 0.000bytes, partitions: 409.600MiB ```
Retry command for Build#59961please wait until all jobs are finished before running the slash command
|
Previously
topic_memory_per_partition
was more a rough guess whosevalue was way too large. This was partly a result of not having a better
idea but also because obviously not all memory can be used by
partitions.
After some analysis via metrics and the memory profiler we now have a
better idea of what the real value is. Hence, we aggressively change the
value down.
At the same time we make the partition_allocator memory limit check
memory group aware. It will no longer compare
topic_memory_per_partition
against the total memory as if partitionscould use all that memory. Instead it now compares it against the memory
reserved for partitions via the memory groups.
We use some heuristics (which are better explained in the code comment)
to try to guard against cases where we would make partition density
worse.
The new defaults are:
merged optimizations in the seastar metrics stack). It's still fairly
conservative. Probably more like 150KiB.
above this effectively gives us twice the partition density compared
to the old calculation and using 4MiB for TMPP
Backports Required
Release Notes
Features