Make partitions memory group aware #24472

StephanDollberg · 2024-12-06T14:04:58Z

Previously topic_memory_per_partition was more a rough guess whose
value was way too large. This was partly a result of not having a better
idea but also because obviously not all memory can be used by
partitions.

After some analysis via metrics and the memory profiler we now have a
better idea of what the real value is. Hence, we aggressively change the
value down.

At the same time we make the partition_allocator memory limit check
memory group aware. It will no longer compare
topic_memory_per_partition against the total memory as if partitions
could use all that memory. Instead it now compares it against the memory
reserved for partitions via the memory groups.

We use some heuristics (which are better explained in the code comment)
to try to guard against cases where we would make partition density
worse.

The new defaults are:

topic_memory_per_partition: 200KiB (this already assumes some to be
merged optimizations in the seastar metrics stack). It's still fairly
conservative. Probably more like 150KiB.
topic_partitions_max_memory_allocation_share: 10% - together with the
above this effectively gives us twice the partition density compared
to the old calculation and using 4MiB for TMPP

Backports Required

Release Notes

Features

TBD

Deflaimun · 2024-12-16T15:38:21Z

src/v/config/configuration.cc

+  , topic_partitions_max_memory_allocation_share(
+      *this,
+      "topic_partitions_max_memory_allocation_share",
+      "foo",


placeholder? :D

Yes thanks fixed.

JakeSCahill · 2024-12-16T16:16:30Z

src/v/config/configuration.cc

+  , topic_partitions_memory_allocation_percent(
+      *this,
+      "topic_partitions_memory_allocation_percent",
+      "Percentage of total memory that is being reserved for topic partitions.",


Suggested change

"Percentage of total memory that is being reserved for topic partitions.",

"Percentage of total memory to reserve for topic partitions.",

vbotbuildovich · 2024-12-16T19:10:23Z

Retry command for Build#59813

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

vbotbuildovich · 2024-12-16T20:37:14Z

CI test results

test results on build#59813

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59813#0193d07b-7144-4975-a69e-f771323494df	FAIL	0/6
rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59813#0193d08e-3cfd-4271-a0b1-b8be6c886960	FAIL	0/1

test results on build#59905

test_id	test_kind	job_url	test_status	passed
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59905#0193d968-7f3b-410b-a73d-f99984280f13	FLAKY	1/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59905#0193d9ae-342d-4f79-a509-5ad743cfc794	FAIL	0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59905#0193d9a9-d3b3-4212-9ce5-0319f44e90fd	FLAKY	4/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59905#0193d9ae-342f-4b4c-89af-501b30c977b7	FAIL	0/6

test results on build#59961

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59961#0193dfd7-5068-430e-85d0-afde68da028d	FAIL	0/6

travisdowns · 2024-12-17T03:54:12Z

src/v/resource_mgmt/memory_groups.h

@@ -31,6 +31,12 @@ struct compaction_memory_reservation {
    double max_limit_pct{100.0};
 };

+struct partitions_memory_reservation {
+    size_t max_limit_pct = 0;


Why is this set to 0, then overritten after construction in memory_groups()? Can't we just set at construction as a designated initializer?

No particular reason. We can change it.

travisdowns · 2024-12-17T18:37:13Z

src/v/cluster/scheduling/partition_allocator.cc

+bool guess_is_memory_group_aware_memory_limit() {
+    // Here we are trying to guess whether `topic_memory_per_partition` is
+    // memory group aware. Originally the property was more like a guess of how
+    // much memory each partition would statically use. It was first set to 10


Pretty sure it was 1 MiB? We raised it, not lowered it IIRC.

Ah now that you say it that rings a bell. I misremembered.

travisdowns · 2024-12-17T18:40:39Z

src/v/cluster/scheduling/partition_allocator.cc

@@ -81,6 +81,92 @@ allocation_constraints partition_allocator::default_constraints() {
    return req;
 }

+bool guess_is_memory_group_aware_memory_limit() {
+    // Here we are trying to guess whether `topic_memory_per_partition` is
+    // memory group aware. Originally the property was more like a guess of how


I don't think it's true. It was a empirical measurement to find an X where 8 GB / X gave a number of partitions that worked on 8 GB instances and 4 GB / X gave a number that worked on 4 GB instances, based on testing on those two instance types (which only really happened incidentially because Arm had some 4 GB instance types we wanted to try).

It was never a guess at the static memory usage of a partition: it was entirely driven by "OOM or not" during tests so included also dynamic memory use which "should" be accounted for in other places (but it it is not always the case).

Just so the history makes some sense (this all makes more sense given the info it was 1 MB not 10 MB: the problem which was noted is that on 4 GB hosts you could not reach the partitios_per_shard value before OOMing, thus this memory setting needed to be raised to lower the limit in that case).

travisdowns · 2024-12-17T18:41:38Z

src/v/cluster/scheduling/partition_allocator.cc

+    // Note that if we get this guess wrong that isn't immediately fatal. This
+    // only becomes active when creating new topics or modifying existing
+    // topics. At that point users can increase the memory reservation if
+    // needed.


love the very comprehensive comment here

travisdowns · 2024-12-17T18:49:38Z

src/v/cluster/scheduling/partition_allocator.cc

+    // than 10x smaller. We assume it's unlikely someone would have changed the
+    // value to be that much smaller and hence guess that all the values larger
+    // than 2 times the new default are old values. Equally in the new world
+    // it's unlikely anybody would increase the value (at all really).


I guess the most likely people are ourselves, if we find out that 100K or whatever is not enough, i.e., there is some other significant source of memory usage we didn't account for. If we need to go to 300K say, we can't because we'd be over this threshold and it would have the opposite as expected effect. However in this case we could just tweak the shares for a similar effect? E.g., reduce shares to 33% of their existing value (this is not really exactly the same since it also in theory gives that memory away to other subsystems but in practice based on how this works it seems similar).

I mean we can always change the default up reallyas the heuristic is only really needed if the value is overriden (at which point we can also reevaluate the heuristic).

Lets say we had to change it to 400K from the current 200K then today I feel like the heuristic as is would still be useful as I don't think anyone would have put it below 1MB (and the caveat that nothing immediately breaks still applies).

travisdowns · 2024-12-17T18:50:49Z

src/v/cluster/scheduling/partition_allocator.cc

+          clusterlog.warn,
+          "Refusing to create {} new partitions as total partition count "
+          "{} "
+          "would exceed memory limit {}",


Please include in the error message more components of the math e.g., memory_limit and other values so we can reconstruct what happened w/o knowing the conifg values.

(yes, I know this is c/p from elsewhere but I think this is even more important now)

Added a few more. Let me know if you think anything is missing.

travisdowns

....

travisdowns · 2024-12-17T18:51:13Z

src/v/cluster/scheduling/partition_allocator.cc

+          clusterlog.warn,
+          "Refusing to create {} new partitions as total partition count "
+          "{} "
+          "would exceed memory limit {}",


(yes, I know this is c/p from elsewhere but I think this is even more important now)

travisdowns · 2024-12-17T18:55:03Z

src/v/cluster/scheduling/topic_memory_per_partition_default.h

+// had overridden the default when it was non-mmemory group aware. Hence, this
+// value should NOT be changed and stay the same even if the (above) default is
+// changed.
+inline constexpr size_t ORIGINAL_MEMORY_GROUP_AWARE_TMPP = 200_KiB;


Well if we have to raise DEFAULT_TOPIC_MEMORY_PER_PARTITION 2x or more we won't be able to really adhere to this, right?

Yes I guess (see the other comment).

We could also remove this bit of logic I am not sure whether it is overdoing it.

travisdowns · 2024-12-17T18:57:34Z

src/v/cluster/scheduling/partition_allocator.h

@@ -34,7 +34,6 @@ class partition_allocator {
    partition_allocator(
      ss::sharded<members_table>&,
      ss::sharded<features::feature_table>&,
-      config::binding<std::optional<size_t>> memory_per_partition,


why was this binding removed? we just hardcode access to the config now?

Yes, the problem with bindings is that you can't access the properties of a config like (is_overriden etc).

travisdowns · 2024-12-17T18:58:17Z

src/v/cluster/scheduling/partition_allocator.h

@@ -149,6 +148,12 @@ class partition_allocator {
      const uint64_t new_partitions_replicas_requested,
      const model::topic_namespace& topic) const;

+    // sub-routine of the above, checks available memory
+    std::error_code check_memory_limits(
+      const uint64_t new_partitions_replicas_requested,


dangling const here or intentional?

travisdowns · 2024-12-17T19:16:05Z

src/v/cluster/scheduling/topic_memory_per_partition_default.h

@@ -0,0 +1,32 @@
+/*


Let us put an INFO-level log line at startup that exposes key info about partition memory calculation, etc? Like you have X memory and this Y p memory setting, and Z shares and W total shares, and parts_per_shard=V so you get this many partitions total.

Sure. Did you put this comment into this file intentionally?

Bit confused as you mention "Z shares and W total shares" but the partition side doesn't use any shares so wondering whether you think we should print only the partition relevant info (maybe in the partition allocator?) or also all the other memory group info (including shares for other subsystems).

@StephanDollberg - I intentionally put it a (essentially arbitrary) file because comments in the top level PR seem to get lost? They are treated different than review comments, don't have threaded conversations, etc. Maybe I could just have a conventon of using the first line of the first file, dunno.

Bit confused as you mention "Z shares and W total shares"

Partitions have 10 shares in memory groups that's what I was referring to. So my comment WXYZ thing was really about memory groups: we should print all the "variable info" from memory groups so we know exactly how the memory was split.

We should also have good logging about partition allocations decisions (and failure reasons), but I think we already have that?

Partitions have 10 shares in memory groups that's what I was referring to. So my comment WXYZ thing was really about memory groups: we should print all the "variable info" from memory groups so we know exactly how the memory was split.

Yeah ok I was just confused because the partitions one doesn't use "shares" but a percentage reservation.

In any case I have pushed a commit which prints all the final memory group allocations on startup.

Looks like:

... [shard 0:main] main - Per shard memory group allocations: total memory: 4.000GiB, total memory minus pre-share reservations: 3.475GiB, chunk cache: 627.953MiB, kafka: 1.226GiB, rpc: 837.271MiB, recovery: 418.635MiB, tiered storage: 418.635MiB, data transforms: 0.000bytes, compaction: 128.000MiB, datalake: 0.000bytes, partitions: 409.600MiB

We should also have good logging about partition allocations decisions (and failure reasons), but I think we already have that?

Yes the updated failure warning should have all the details now.

travisdowns

A couple of changes suggested.

That function is getting kinda big. Move the memory checking part out in preparation for some changes to come.

It's not being watched so no need to use a binding. Also bindings don't allow using any of the advanced property methods (is_overriden etc.).

Static partition memory use is currently unaccounted for in memory groups. This patch introduces a reserve to account for partition memory usage. We don't use a share like some of the other groups but rather an upfront percentage based. This is because we don't want the partitions share to shrink once we add more groups.

Previously `topic_memory_per_partition` was more a rough guess whose value was way too large. This was partly a result of not having a better idea but also because obviously not all memory can be used by partitions. After some analysis via metrics and the memory profiler we now have a better idea of what the real value is. Hence, we aggressively change the value down. At the same time we make the partition_allocator memory limit check memory group aware. It will no longer compare `topic_memory_per_partition` against the total memory as if partitions could use all that memory. Instead it now compares it against the memory reserved for partitions via using the memory groups. We use some heuristics (which are better explained in the code comment) to try to guard against cases where we would make partition density worse. The new defaults are: - topic_memory_per_partition: 200KiB (this already assumes some to be merged optimizations in the seastar metrics stack). It's still fairly conservative. Probably more like 150KiB. - topic_partitions_max_memory_allocation_share: 10% - together with the above this effectively gives us twice the partition density compared to the old calculation and using 4MiB for TMPP

vbotbuildovich · 2024-12-18T14:08:33Z

Retry command for Build#59905

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

Adds a log to startup printing all the final memory group allocations to the different sub systems. Logs like: ``` ... [shard 0:main] main - Per shard memory group allocations: total memory: 4.000GiB, total memory minus pre-share reservations: 3.475GiB, chunk cache: 627.953MiB, kafka: 1.226GiB, rpc: 837.271MiB, recovery: 418.635MiB, tiered storage: 418.635MiB, data transforms: 0.000bytes, compaction: 128.000MiB, datalake: 0.000bytes, partitions: 409.600MiB ```

vbotbuildovich · 2024-12-19T19:05:10Z

Retry command for Build#59961

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

github-actions bot added area/build area/redpanda labels Dec 6, 2024

StephanDollberg force-pushed the stephan/partition-memory-groups branch 3 times, most recently from 4564975 to 1602ba6 Compare December 6, 2024 22:22

redpanda-data deleted a comment from vbotbuildovich Dec 6, 2024

StephanDollberg force-pushed the stephan/partition-memory-groups branch from 1602ba6 to 419c29b Compare December 9, 2024 14:16

redpanda-data deleted a comment from vbotbuildovich Dec 9, 2024

StephanDollberg changed the title ~~Stephan/partition memory groups~~ Make partitions memory group aware Dec 10, 2024

StephanDollberg marked this pull request as ready for review December 16, 2024 09:24

StephanDollberg requested a review from a team as a code owner December 16, 2024 09:24

StephanDollberg requested review from dotnwat, travisdowns and mmaslankaprv December 16, 2024 09:24

redpanda-data deleted a comment from vbotbuildovich Dec 16, 2024

Deflaimun reviewed Dec 16, 2024

View reviewed changes

StephanDollberg force-pushed the stephan/partition-memory-groups branch from 419c29b to 2262c91 Compare December 16, 2024 15:59

JakeSCahill reviewed Dec 16, 2024

View reviewed changes

travisdowns reviewed Dec 17, 2024

View reviewed changes

travisdowns requested changes Dec 17, 2024

View reviewed changes

StephanDollberg added 4 commits December 18, 2024 10:19

partition_allocator: Move memory check out of check_limits

5976feb

That function is getting kinda big. Move the memory checking part out in preparation for some changes to come.

partition_allocator: Don't use binding for topic_memory_per_partition

04a0e39

It's not being watched so no need to use a binding. Also bindings don't allow using any of the advanced property methods (is_overriden etc.).

StephanDollberg force-pushed the stephan/partition-memory-groups branch from 2262c91 to f820490 Compare December 18, 2024 10:52

travisdowns approved these changes Dec 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make partitions memory group aware #24472

Make partitions memory group aware #24472

StephanDollberg commented Dec 6, 2024 •

edited

Loading

Deflaimun Dec 16, 2024

StephanDollberg Dec 16, 2024

JakeSCahill Dec 16, 2024

vbotbuildovich commented Dec 16, 2024 •

edited

Loading

vbotbuildovich commented Dec 16, 2024 •

edited

Loading

travisdowns Dec 17, 2024

StephanDollberg Dec 17, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 17, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 17, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns left a comment •

edited

Loading

travisdowns Dec 17, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 17, 2024

StephanDollberg Dec 18, 2024

travisdowns Dec 19, 2024

travisdowns Dec 19, 2024

StephanDollberg Dec 19, 2024

travisdowns left a comment •

edited

Loading

vbotbuildovich commented Dec 18, 2024 •

edited

Loading

vbotbuildovich commented Dec 19, 2024

	"Percentage of total memory that is being reserved for topic partitions.",
	"Percentage of total memory to reserve for topic partitions.",

Make partitions memory group aware #24472

Are you sure you want to change the base?

Make partitions memory group aware #24472

Conversation

StephanDollberg commented Dec 6, 2024 • edited Loading

Backports Required

Release Notes

Features

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Dec 16, 2024 • edited Loading

Retry command for Build#59813

vbotbuildovich commented Dec 16, 2024 • edited Loading

CI test results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns left a comment • edited Loading

Choose a reason for hiding this comment

vbotbuildovich commented Dec 18, 2024 • edited Loading

Retry command for Build#59905

vbotbuildovich commented Dec 19, 2024

Retry command for Build#59961

StephanDollberg commented Dec 6, 2024 •

edited

Loading

vbotbuildovich commented Dec 16, 2024 •

edited

Loading

vbotbuildovich commented Dec 16, 2024 •

edited

Loading

travisdowns left a comment •

edited

Loading

travisdowns left a comment •

edited

Loading

vbotbuildovich commented Dec 18, 2024 •

edited

Loading