Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make partitions memory group aware #24472

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

StephanDollberg
Copy link
Member

@StephanDollberg StephanDollberg commented Dec 6, 2024

Previously topic_memory_per_partition was more a rough guess whose
value was way too large. This was partly a result of not having a better
idea but also because obviously not all memory can be used by
partitions.

After some analysis via metrics and the memory profiler we now have a
better idea of what the real value is. Hence, we aggressively change the
value down.

At the same time we make the partition_allocator memory limit check
memory group aware. It will no longer compare
topic_memory_per_partition against the total memory as if partitions
could use all that memory. Instead it now compares it against the memory
reserved for partitions via the memory groups.

We use some heuristics (which are better explained in the code comment)
to try to guard against cases where we would make partition density
worse.

The new defaults are:

  • topic_memory_per_partition: 200KiB (this already assumes some to be
    merged optimizations in the seastar metrics stack). It's still fairly
    conservative. Probably more like 150KiB.
  • topic_partitions_max_memory_allocation_share: 10% - together with the
    above this effectively gives us twice the partition density compared
    to the old calculation and using 4MiB for TMPP

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Features

  • TBD

@StephanDollberg StephanDollberg force-pushed the stephan/partition-memory-groups branch 3 times, most recently from 4564975 to 1602ba6 Compare December 6, 2024 22:22
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 6, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 6, 2024
@StephanDollberg StephanDollberg force-pushed the stephan/partition-memory-groups branch from 1602ba6 to 419c29b Compare December 9, 2024 14:16
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 9, 2024
@StephanDollberg StephanDollberg changed the title Stephan/partition memory groups Make partitions memory group aware Dec 10, 2024
@StephanDollberg StephanDollberg marked this pull request as ready for review December 16, 2024 09:24
@StephanDollberg StephanDollberg requested a review from a team as a code owner December 16, 2024 09:24
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 16, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 16, 2024
, topic_partitions_max_memory_allocation_share(
*this,
"topic_partitions_max_memory_allocation_share",
"foo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placeholder? :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks fixed.

@StephanDollberg StephanDollberg force-pushed the stephan/partition-memory-groups branch from 419c29b to 2262c91 Compare December 16, 2024 15:59
, topic_partitions_memory_allocation_percent(
*this,
"topic_partitions_memory_allocation_percent",
"Percentage of total memory that is being reserved for topic partitions.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Percentage of total memory that is being reserved for topic partitions.",
"Percentage of total memory to reserve for topic partitions.",

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 16, 2024

Retry command for Build#59813

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 16, 2024

CI test results

test results on build#59813
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59813#0193d07b-7144-4975-a69e-f771323494df FAIL 0/6
rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True ducktape https://buildkite.com/redpanda/redpanda/builds/59813#0193d08e-3cfd-4271-a0b1-b8be6c886960 FAIL 0/1
test results on build#59905
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59905#0193d968-7f3b-410b-a73d-f99984280f13 FLAKY 1/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59905#0193d9ae-342d-4f79-a509-5ad743cfc794 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59905#0193d9a9-d3b3-4212-9ce5-0319f44e90fd FLAKY 4/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59905#0193d9ae-342f-4b4c-89af-501b30c977b7 FAIL 0/6
test results on build#59961
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59961#0193dfd7-5068-430e-85d0-afde68da028d FAIL 0/6

@@ -31,6 +31,12 @@ struct compaction_memory_reservation {
double max_limit_pct{100.0};
};

struct partitions_memory_reservation {
size_t max_limit_pct = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this set to 0, then overritten after construction in memory_groups()? Can't we just set at construction as a designated initializer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. We can change it.

bool guess_is_memory_group_aware_memory_limit() {
// Here we are trying to guess whether `topic_memory_per_partition` is
// memory group aware. Originally the property was more like a guess of how
// much memory each partition would statically use. It was first set to 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure it was 1 MiB? We raised it, not lowered it IIRC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now that you say it that rings a bell. I misremembered.

@@ -81,6 +81,92 @@ allocation_constraints partition_allocator::default_constraints() {
return req;
}

bool guess_is_memory_group_aware_memory_limit() {
// Here we are trying to guess whether `topic_memory_per_partition` is
// memory group aware. Originally the property was more like a guess of how
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's true. It was a empirical measurement to find an X where 8 GB / X gave a number of partitions that worked on 8 GB instances and 4 GB / X gave a number that worked on 4 GB instances, based on testing on those two instance types (which only really happened incidentially because Arm had some 4 GB instance types we wanted to try).

It was never a guess at the static memory usage of a partition: it was entirely driven by "OOM or not" during tests so included also dynamic memory use which "should" be accounted for in other places (but it it is not always the case).

Just so the history makes some sense (this all makes more sense given the info it was 1 MB not 10 MB: the problem which was noted is that on 4 GB hosts you could not reach the partitios_per_shard value before OOMing, thus this memory setting needed to be raised to lower the limit in that case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

// Note that if we get this guess wrong that isn't immediately fatal. This
// only becomes active when creating new topics or modifying existing
// topics. At that point users can increase the memory reservation if
// needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the very comprehensive comment here

// than 10x smaller. We assume it's unlikely someone would have changed the
// value to be that much smaller and hence guess that all the values larger
// than 2 times the new default are old values. Equally in the new world
// it's unlikely anybody would increase the value (at all really).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the most likely people are ourselves, if we find out that 100K or whatever is not enough, i.e., there is some other significant source of memory usage we didn't account for. If we need to go to 300K say, we can't because we'd be over this threshold and it would have the opposite as expected effect. However in this case we could just tweak the shares for a similar effect? E.g., reduce shares to 33% of their existing value (this is not really exactly the same since it also in theory gives that memory away to other subsystems but in practice based on how this works it seems similar).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we can always change the default up reallyas the heuristic is only really needed if the value is overriden (at which point we can also reevaluate the heuristic).

Lets say we had to change it to 400K from the current 200K then today I feel like the heuristic as is would still be useful as I don't think anyone would have put it below 1MB (and the caveat that nothing immediately breaks still applies).

clusterlog.warn,
"Refusing to create {} new partitions as total partition count "
"{} "
"would exceed memory limit {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include in the error message more components of the math e.g., memory_limit and other values so we can reconstruct what happened w/o knowing the conifg values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(yes, I know this is c/p from elsewhere but I think this is even more important now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more. Let me know if you think anything is missing.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....

clusterlog.warn,
"Refusing to create {} new partitions as total partition count "
"{} "
"would exceed memory limit {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(yes, I know this is c/p from elsewhere but I think this is even more important now)

// had overridden the default when it was non-mmemory group aware. Hence, this
// value should NOT be changed and stay the same even if the (above) default is
// changed.
inline constexpr size_t ORIGINAL_MEMORY_GROUP_AWARE_TMPP = 200_KiB;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if we have to raise DEFAULT_TOPIC_MEMORY_PER_PARTITION 2x or more we won't be able to really adhere to this, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I guess (see the other comment).

We could also remove this bit of logic I am not sure whether it is overdoing it.

@@ -34,7 +34,6 @@ class partition_allocator {
partition_allocator(
ss::sharded<members_table>&,
ss::sharded<features::feature_table>&,
config::binding<std::optional<size_t>> memory_per_partition,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this binding removed? we just hardcode access to the config now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problem with bindings is that you can't access the properties of a config like (is_overriden etc).

@@ -149,6 +148,12 @@ class partition_allocator {
const uint64_t new_partitions_replicas_requested,
const model::topic_namespace& topic) const;

// sub-routine of the above, checks available memory
std::error_code check_memory_limits(
const uint64_t new_partitions_replicas_requested,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dangling const here or intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,32 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us put an INFO-level log line at startup that exposes key info about partition memory calculation, etc? Like you have X memory and this Y p memory setting, and Z shares and W total shares, and parts_per_shard=V so you get this many partitions total.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Did you put this comment into this file intentionally?

Bit confused as you mention "Z shares and W total shares" but the partition side doesn't use any shares so wondering whether you think we should print only the partition relevant info (maybe in the partition allocator?) or also all the other memory group info (including shares for other subsystems).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StephanDollberg - I intentionally put it a (essentially arbitrary) file because comments in the top level PR seem to get lost? They are treated different than review comments, don't have threaded conversations, etc. Maybe I could just have a conventon of using the first line of the first file, dunno.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit confused as you mention "Z shares and W total shares"

Partitions have 10 shares in memory groups that's what I was referring to. So my comment WXYZ thing was really about memory groups: we should print all the "variable info" from memory groups so we know exactly how the memory was split.

We should also have good logging about partition allocations decisions (and failure reasons), but I think we already have that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partitions have 10 shares in memory groups that's what I was referring to. So my comment WXYZ thing was really about memory groups: we should print all the "variable info" from memory groups so we know exactly how the memory was split.

Yeah ok I was just confused because the partitions one doesn't use "shares" but a percentage reservation.

In any case I have pushed a commit which prints all the final memory group allocations on startup.

Looks like:

... [shard 0:main] main - Per shard memory group allocations: total memory: 4.000GiB, total memory minus pre-share reservations: 3.475GiB, chunk cache: 627.953MiB, kafka: 1.226GiB, rpc: 837.271MiB, recovery: 418.635MiB, tiered storage: 418.635MiB, data transforms: 0.000bytes, compaction: 128.000MiB, datalake: 0.000bytes, partitions: 409.600MiB

We should also have good logging about partition allocations decisions (and failure reasons), but I think we already have that?

Yes the updated failure warning should have all the details now.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of changes suggested.

That function is getting kinda big. Move the memory checking part out in
preparation for some changes to come.
It's not being watched so no need to use a binding. Also bindings don't
allow using any of the advanced property methods (is_overriden etc.).
Static partition memory use is currently unaccounted for in memory
groups.

This patch introduces a reserve to account for partition memory usage.

We don't use a share like some of the other groups but rather an upfront
percentage based. This is because we don't want the partitions share to
shrink once we add more groups.
Previously `topic_memory_per_partition` was more a rough guess whose
value was way too large. This was partly a result of not having a better
idea but also because obviously not all memory can be used by
partitions.

After some analysis via metrics and the memory profiler we now have a
better idea of what the real value is. Hence, we aggressively change the
value down.

At the same time we make the partition_allocator memory limit check
memory group aware. It will no longer compare
`topic_memory_per_partition` against the total memory as if partitions
could use all that memory. Instead it now compares it against the memory
reserved for partitions via using the memory groups.

We use some heuristics (which are better explained in the code comment)
to try to guard against cases where we would make partition density
worse.

The new defaults are:

 - topic_memory_per_partition: 200KiB (this already assumes some to be
   merged optimizations in the seastar metrics stack). It's still fairly
   conservative. Probably more like 150KiB.
 - topic_partitions_max_memory_allocation_share: 10% - together with the
   above this effectively gives us twice the partition density compared
   to the old calculation and using 4MiB for TMPP
@StephanDollberg StephanDollberg force-pushed the stephan/partition-memory-groups branch from 2262c91 to f820490 Compare December 18, 2024 10:52
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 18, 2024

Retry command for Build#59905

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

Adds a log to startup printing all the final memory group allocations to
the different sub systems.

Logs like:

```
... [shard 0:main] main - Per shard memory group allocations: total memory: 4.000GiB, total memory minus pre-share reservations: 3.475GiB, chunk cache: 627.953MiB, kafka: 1.226GiB, rpc: 837.271MiB, recovery: 418.635MiB, tiered storage: 418.635MiB, data transforms: 0.000bytes, compaction: 128.000MiB, datalake: 0.000bytes, partitions: 409.600MiB
```
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59961

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants