Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datalake/metrics: miscellaneous improvements to lag metrics #24568

Merged
merged 1 commit into from
Dec 14, 2024

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Dec 13, 2024

This is based on our experience debugging Brandon's perf setup.

  • Changes the metric reporting to report lag only on leaders. This makes it easy to monitor the metric using an aggregate across all replicas without having to worry about the current leader.

  • Fixed a bug where lag entry was not added to serde fields, adjusted the test coverage to catch this scenario, refactored the test slightly while I'm there.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Fixes accounting of iceberg commit lag metric that can remain erroneously high in some cases even though the translation if fully caught up. Additionally the change ensures that only partition leaders emit lag metrics while followers emit 0 lag.

@bharathv
Copy link
Contributor Author

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 13, 2024

Retry command for Build#59736

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":false,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":true,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":false,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":true}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":true,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":true}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":20}
tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 13, 2024

CI test results

test results on build#59736
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=20.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26fa-4381-a96d-3f474944f751 FAIL 0/6
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26f7-4897-ab06-b9d280a523b8 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c17a-d297-4d6c-aa15-83e75a4d9418 FLAKY 3/6
rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26fa-4381-a96d-3f474944f751 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=False.with_tiered_storage=False.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26fa-4381-a96d-3f474944f751 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=False.with_tiered_storage=True.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26fa-4381-a96d-3f474944f751 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=False.with_tiered_storage=False.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26fa-4381-a96d-3f474944f751 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=False.with_tiered_storage=True.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59736#0193c186-26fa-4381-a96d-3f474944f751 FAIL 0/1
test results on build#59747
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=20.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c269-6abc-4911-bdab-f1517aca740b FAIL 0/6
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c269-6aba-4948-afa6-e470f8d852b8 FAIL 0/6
rptest.tests.datalake.simple_connect_test.RedpandaConnectIcebergTest.test_translating_avro_serialized_records.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c27c-b710-4826-9128-c6f07a40a219 FLAKY 5/6
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=False.with_tiered_storage=False.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c269-6abc-4911-bdab-f1517aca740b FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=False.with_tiered_storage=True.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c269-6abc-4911-bdab-f1517aca740b FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=False.with_tiered_storage=False.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c269-6abc-4911-bdab-f1517aca740b FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=False.with_tiered_storage=True.with_iceberg=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59747#0193c269-6abc-4911-bdab-f1517aca740b FAIL 0/1
test results on build#59767
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59767#0193c63f-fc14-4c1e-9b1a-83564aa6dc6c FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59767#0193c63f-fc16-403d-9fc0-1cdb5da1acea FLAKY 1/6

@bharathv bharathv marked this pull request as ready for review December 13, 2024 22:26
@bharathv bharathv added this to the v24.3.x-next milestone Dec 13, 2024
andrwng
andrwng previously approved these changes Dec 13, 2024
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but deserves a release note

rockwotj
rockwotj previously approved these changes Dec 14, 2024
@@ -168,14 +180,19 @@ void replicated_partition_probe::setup_internal_metrics(const model::ntp& ntp) {
{sm::shard_label, partition_label});

if (model::is_user_topic(_partition.ntp())) {
// Metrics are reported as follows
// -2 (default initialized state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this to the public description too for downstream systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 14, 2024

Retry command for Build#59747

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":false,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":false,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":true}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":true,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":2,"enable_failures":true,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":true}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":20}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

This is based on our experience debugging Brandon's perf setup.

- Changes the metric reporting to report lag only on leaders. This makes
  it easy to monitor the metric using an aggregate across all replicas
  without having to worry about the current leader.

- Fixed a bug where lag entry was not added to serde fields, adjusted
  the test coverage to catch this scenario, refactored the test slightly
  while I'm there.
@bharathv bharathv dismissed stale reviews from rockwotj and andrwng via bef3254 December 14, 2024 16:21
@bharathv bharathv requested review from rockwotj and andrwng December 14, 2024 16:21
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59767

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@piyushredpanda piyushredpanda merged commit 4343e5e into redpanda-data:dev Dec 14, 2024
15 of 18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants