Skip to content

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Sep 24, 2025

We were previously leaving behind empty partitions in the builder. This meant that after a bad object is removed from the builder, the call to replicated_metastore::add_objects() would examine the empty set of objects meant for a given metastore partition, and expect there to be terms routed for that partition. This resulted in the following error:

ERROR 2025-09-24 06:14:38,875 [shard 0:main] cloud_topics - replicated_metastore.cc:320 - No term metadata routed to partition 1

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • None

@Copilot Copilot AI review requested due to automatic review settings September 24, 2025 20:49
@andrwng andrwng review requested due to automatic review settings September 24, 2025 20:49
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where removing the last object from a metastore partition would leave behind empty partition metadata in the builder. This caused failures when add_objects() encountered empty partitions that still expected term routing information.

Key changes:

  • Added logic to clean up empty partitions when the last object is removed from a builder
  • Added regression test to verify the fix

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/v/cloud_topics/level_one/metastore/replicated_metastore.cc Adds partition cleanup logic when removing the last object leaves partition empty
src/v/cloud_topics/level_one/metastore/tests/replicated_metastore_test.cc Adds regression test that reproduces and verifies fix for empty partition cleanup

rockwotj
rockwotj previously approved these changes Sep 24, 2025
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

dotnwat
dotnwat previously approved these changes Sep 24, 2025
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@dotnwat dotnwat enabled auto-merge September 24, 2025 21:27
wdberkeley
wdberkeley previously approved these changes Sep 24, 2025
@andrwng andrwng disabled auto-merge September 24, 2025 23:01
We were previously leaving behind empty partitions in the builder. This
meant that after a bad object is removed from the builder, the call to
replicated_metastore::add_objects() would examine the empty set of
objects meant for a given metastore partition, and expect there to be
terms routed for that partition. This resulted in the following error:

ERROR 2025-09-24 06:14:38,875 [shard 0:main] cloud_topics - replicated_metastore.cc:320 - No term metadata routed to partition 1
@andrwng andrwng dismissed stale reviews from wdberkeley, dotnwat, and rockwotj via 0b4e538 September 24, 2025 23:06
@andrwng andrwng force-pushed the ct-l1-remove-obj-empty branch from d2387f6 to 0b4e538 Compare September 24, 2025 23:06
@andrwng
Copy link
Contributor Author

andrwng commented Sep 24, 2025

Force pushed to rebase and pass the object builder by const ref

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#72906

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_response_throttle_mechanism_applies_to_next_request

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 25, 2025

CI test results

test results on build#72902
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterRateQuotaTest test_client_response_and_produce_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72902#01997ebb-d0b9-4dc2-be13-5cb22686a0e8 FLAKY 13/21 upstream reliability is '80.43087971274686'. current run reliability is '61.904761904761905'. drift is 18.52612 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_and_produce_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/72902#01997ebb-d0bb-4685-9e0f-848262b07461 FLAKY 14/21 upstream reliability is '83.75262054507337'. current run reliability is '66.66666666666666'. drift is 17.08595 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
test results on build#72906
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterRateQuotaTest test_client_group_produce_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8c-4956-4007-9c8a-606d6e05e440 FLAKY 14/21 upstream reliability is '85.09389671361502'. current run reliability is '66.66666666666666'. drift is 18.42723 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_produce_rate_throttle_mechanism
ClusterRateQuotaTest test_client_response_and_produce_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8b-a0fd-4b1d-aaa8-c8b4dc446c7d FLAKY 13/21 upstream reliability is '80.43087971274686'. current run reliability is '61.904761904761905'. drift is 18.52612 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_and_produce_throttle_mechanism
ClusterRateQuotaTest test_client_response_and_produce_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8c-4957-488e-af7c-43596ff7835d FLAKY 16/21 upstream reliability is '92.84232365145229'. current run reliability is '76.19047619047619'. drift is 16.65185 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_and_produce_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8b-a0fe-486a-9984-9963332bcadb FLAKY 15/21 upstream reliability is '92.63984298331698'. current run reliability is '71.42857142857143'. drift is 21.21127 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8b-a0ff-4c67-a5fb-07e22515d358 FLAKY 9/21 upstream reliability is '93.99293286219081'. current run reliability is '42.857142857142854'. drift is 51.13579 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8c-4958-4534-9e0c-6a065959eadc FLAKY 15/21 upstream reliability is '93.98584905660378'. current run reliability is '71.42857142857143'. drift is 22.55728 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
DisablingPartitionsTest test_disable null integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8c-495a-4da9-b267-192d17179b49 FLAKY 11/21 upstream reliability is '83.86194029850746'. current run reliability is '52.38095238095239'. drift is 31.48099 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DisablingPartitionsTest&test_method=test_disable
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/72906#01997e8b-a0fe-486a-9984-9963332bcadb FLAKY 15/21 upstream reliability is '94.84126984126983'. current run reliability is '71.42857142857143'. drift is 23.4127 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@andrwng andrwng merged commit 637c0d6 into redpanda-data:dev Sep 25, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants