Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-8637]: storage: fix race between disk_log_impl::new_segment() and disk_log_impl::close() #24635

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions src/v/storage/disk_log_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,12 @@ size_t disk_log_impl::compute_max_segment_size() {

ss::future<> disk_log_impl::remove() {
vassert(!_closed, "Invalid double closing of log - {}", *this);

// To prevent racing with a new segment being rolled, obtain the mutex here,
// and indicate it as broken for any future waiters.
auto roll_lock_holder = co_await _segments_rolling_lock.get_units();
_segments_rolling_lock.broken();
Comment on lines +168 to +169
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this works, but it's a bit unorthodox. normally we'd expect a pattern where we invoke broken(), and then wait on a gate.

did you consider seting _closed in close/remove after we wait on the compaction house keeping gate?

Copy link
Contributor Author

@WillemKauf WillemKauf Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider seting _closed in close/remove after we wait on the compaction house keeping gate?

If the race condition in fact stemmed from maybe_roll_unlocked() instead of apply_segment_ms() (contrary to the example I described above), waiting on _compaction_housekeeping_gate and only setting _closed after wouldn't save us from the race condition (since maybe_roll_unlocked() doesn't consider the _compaction_housekeeping_gate).

i think this works, but it's a bit unorthodox.

I agree. But, it seems that synchronization with both _segments_rolling_lock and the housekeeping gate could be the most tightly-sealed solution for preventing races here. Open to other suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the race condition in fact stemmed from maybe_roll_unlocked() instead of apply_segment_ms() (contrary to the example I described above), waiting on _compaction_housekeeping_gate and only setting _closed after wouldn't save us from the race condition (since maybe_roll_unlocked() doesn't consider the _compaction_housekeeping_gate).

if this were true it would mean that close/remove were racing with an append? that seems odd since we'd expect the raft layer to full control access to the log and provide mutual exclusion of these operations. do we know more about the specific scenario that lead to the assertion failure?

Copy link
Contributor Author

@WillemKauf WillemKauf Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this were true it would mean that close/remove were racing with an append? that seems odd since we'd expect the raft layer to full control access to the log and provide mutual exclusion of these operations

To be clear, I was just using maybe_roll_unlocked() as an example of another location which calls new_segment() which wouldn't be protected from this race condition (bar higher-level logic) if we were to use the proposed solution of setting _closed after waiting on the _compaction_housekeeping_gate (should other code change in the future)

do we know more about the specific scenario that lead to the assertion failure?

Other than the stacktrace in the JIRA ticket (which doesn't reveal the origin of new_segment()), no.

I'm leaning towards the race I originally described between apply_segment_ms() and close()/remove() being the culprit, in which case your proposed solution would be a fix. I'm not suggesting there is a higher level bug in Raft.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was it triggered during a shutdown? seems like it could also be a race between close (shutting down) and remove (topic delete, or partition movement, etc...)


_closed = true;
// wait for compaction to finish
_compaction_as.request_abort();
Expand Down Expand Up @@ -209,6 +215,12 @@ disk_log_impl::start(std::optional<truncate_prefix_config> truncate_cfg) {

ss::future<std::optional<ss::sstring>> disk_log_impl::close() {
vassert(!_closed, "Invalid double closing of log - {}", *this);

// To prevent racing with a new segment being rolled, obtain the mutex here,
// and indicate it as broken for any future waiters.
auto roll_lock_holder = co_await _segments_rolling_lock.get_units();
_segments_rolling_lock.broken();

vlog(stlog.debug, "closing log {}", *this);
_closed = true;
if (
Expand Down
20 changes: 20 additions & 0 deletions src/v/storage/tests/storage_e2e_fixture_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -195,3 +195,23 @@ FIXTURE_TEST(test_concurrent_segment_roll_and_close, storage_e2e_fixture) {
std::move(release_holder_fut))
.get();
}

FIXTURE_TEST(test_concurrent_segment_roll_and_ntp_remove, storage_e2e_fixture) {
const auto topic_name = model::topic("tapioca");
const auto ntp = model::ntp(model::kafka_namespace, topic_name, 0);

cluster::topic_properties props;
add_topic({model::kafka_namespace, topic_name}, 1, props).get();
wait_for_leader(ntp).get();

auto& partition_manager = app.partition_manager.local();
auto partition = partition_manager.get(ntp);
auto* log = dynamic_cast<storage::disk_log_impl*>(partition->log().get());

auto roll_fut = ss::sleep(100ms).then(
[log] { return force_roll_log(log); });
auto remove_ntp_fut = partition_manager.remove(
ntp, cluster::partition_removal_mode::local_only);

ss::when_all(std::move(roll_fut), std::move(remove_ntp_fut)).get();
}
Loading