-
Notifications
You must be signed in to change notification settings - Fork 678
[CORE-13298] Reimplement consumer_lag from health_report #27716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
[CORE-13298] Reimplement consumer_lag from health_report #27716
Conversation
Signed-off-by: Ben Pope <[email protected]>
Signed-off-by: Ben Pope <[email protected]>
This reduces the number of requests to the controller from one per-shard to one per broker. Signed-off-by: Ben Pope <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
The purpose of this PR is to reimplement the consumer group lag metrics functionality to use the health monitor system instead of a custom RPC service. This optimization reduces the number of requests from one per group coordinator (broker-shards) to one per broker by leveraging health report data that contains high watermark (HWM) information.
Key changes:
- Remove the entire consumer group lag metrics RPC infrastructure (frontend, service, types)
- Integrate lag collection directly into group_manager using health monitor data
- Update test timing to account for both lag collection and health monitor intervals
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
src/v/redpanda/application.h |
Remove consumer group lag metrics frontend member variable |
src/v/redpanda/application.cc |
Remove construction and initialization of lag metrics frontend and service |
src/v/kafka/server/group_manager.h |
Replace lag metrics frontend dependency with health monitor frontend |
src/v/kafka/server/group_manager.cc |
Reimplement lag collection using health monitor reports instead of RPC calls |
src/v/kafka/server/fwd.h |
Remove forward declarations for removed classes |
src/v/kafka/server/consumer_group_lag_metrics_* |
Complete removal of custom RPC infrastructure files |
src/v/kafka/server/BUILD |
Remove build rules for deleted RPC service |
src/v/config/configuration.cc |
Update configuration documentation to mention health monitor dependency |
tests/rptest/tests/consumer_group_test.py |
Update test timing and configuration for new health monitor integration |
{ | ||
"enable_consumer_group_metrics": ["group", "partition", "consumer_lag"], | ||
"consumer_group_lag_collection_interval_sec": lag_collection_interval, | ||
"health_monitor_max_metadata_age": lag_collection_interval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The health_monitor_max_metadata_age should be set independently of lag_collection_interval. These are different concepts - one controls metadata freshness and the other controls lag collection frequency. Consider using a separate variable or explaining why they should be equal.
Copilot uses AI. Check for mistakes.
CI test resultstest results on build#72887
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great
for (auto& [node_id, filter] : requests) { | ||
auto report_r = co_await _hm_frontend.local().get_cluster_health( | ||
std::move(filter), | ||
cluster::force_refresh::no, | ||
model::timeout_clock::now() + _lag_collection_interval()); | ||
if (report_r) { | ||
responses.emplace(node_id, std::move(report_r).assume_value()); | ||
} else { | ||
vlog( | ||
klog.warn, | ||
"group_manager::collect_consumer_lag_metrics: " | ||
"failed to get cluster health report: {}", | ||
report_r.error()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we need to issue these in parallel / with some parallelism, but I think we don't need that because this is not issuing requests directly but rather querying cached state and, if stale, issues a full refresh. Is my understanding correct there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it does seem to do a full refresh from all nodes, so subsequent requests in the loop should be served from cached state.
I guess we could change the api of the filters to be more expressive. Or I could request the full ntp set from all nodes and then filter by leader.
runtime_services.push_back( | ||
std::make_unique<kafka::consumer_group_lag_metrics_service>( | ||
sched_groups.cluster_sg(), | ||
smp_service_groups.cluster_smp_sg(), | ||
std::ref(_consumer_group_lag_metrics_frontend))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep this code around for 1 major version to avoid the metrics disappearing during an upgrade?
I think if we remove this now, during an upgrade, old nodes would try to hit this endpoint still but will get back rpc::errc::method_not_found
and the metrics would be broken during the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not too onerous.
for (auto shard_id : std::views::iota(ss::shard_id(0), ss::smp::count)) { | ||
co_await container().invoke_on(shard_id, collect_requests); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: co_await container().invoke_on_all(collect_requests);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
co_await container().invoke_on_all(collect_requests);
invoke_on_all
is parallel, collect_requests
mutates shared state, so sequentially was intentional. I initially did it with a map_reduce
, but the reduce is a bit ugly. I can take another stab at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not pretty
auto r = co_await container().map_reduce0(
collect_requests, requests_t{}, [](requests_t acc, requests_t val) {
for (auto& [leader_id, filter] : val) {
auto& acc_report = acc
.try_emplace(
leader_id,
cluster::cluster_report_filter{
.nodes = {leader_id}})
.first->second;
for (auto& [ns, topics] :
filter.node_report_filter.ntp_filters.namespaces) {
auto& acc_ns
= acc_report.node_report_filter.ntp_filters.namespaces[ns];
for (auto& [topic, parts] : topics) {
auto& acc_topic = acc_ns[topic];
acc_topic.insert(parts.begin(), parts.end());
}
}
}
return acc;
});
"Updates will not be more frequent than " | ||
"`health_monitor_max_metadata_age`.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
health_monitor_max_metadata_age
currently only limits how fresh the hwm is, but not how fresh the group offsets are, if I'm not mistaken. I'm wondering if we should update the docs to call that out or use std::max(consumer_group_lag_collection_interval_sec, health_monitor_max_metadata_age)
as the collection interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets are read live, whilst the HWM may be a little stale, so it's possible to have a negative lag, which is unfortunate.
Since I'm pushing consumer_group_lag_collection_interval
into the build report, the data can be stale by min(health_monitor_max_metadata_age, consumer_group_lag_collection_interval)
.
"How often to run the collection loop when enable_consumer_group_metrics " | ||
"contains consumer_lag", | ||
"contains consumer_lag." | ||
"Updates will not be more frequent than " | ||
"`health_monitor_max_metadata_age`.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"How often to run the collection loop when enable_consumer_group_metrics " | |
"contains consumer_lag", | |
"contains consumer_lag." | |
"Updates will not be more frequent than " | |
"`health_monitor_max_metadata_age`.", | |
"How often Redpanda runs the collection loop when `enable_consumer_group_metrics` is set to `consumer_lag`. Updates will not be more frequent than `health_monitor_max_metadata_age`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i hate that this does not preserve spaces sometimes :(
Reimplement lag metrics from health report
The health report now contains HWM for each partition, which means we don't have to fetch them from every node, we can ask the controller.
Also optimise the number of requests from one per group coordinator (broker-shards) to one per broker. by gathering all required HWM from all shards.
Backports Required
Release Notes