Skip to content

Conversation

@WillemKauf
Copy link
Contributor

This is dangerous because capturing r = reservation_fut.get() and then passing a pointer to it to save_to_cache() will result in a use-after-free should the future be resolved after the consumer lambda is destructed.

Change the lambda signature to only capture values that will be moved or copied into save_to_cache() to ensure lifetime safety.

Relevant Backtrace:

Backtrace:
[Backtrace #0]
seastar::guarded_backtrace(void**, int) at ./external/+non_module_dependencies+seastar/src/util/backtrace.cc:102
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::'lambda'(seastar::frame)>(seastar::backtrace_buffer::append_backtrace()::'lambda'(seastar::frame)&&, bool) at ./external/+non_module_dependencies+seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:801
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:839
seastar::print_with_backtrace(char const*, bool, bool) at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:859
 (inlined by) seastar::sigsegv_action(siginfo_t*, ucontext_t*) at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:4260
 (inlined by) void seastar::install_oneshot_signal_handler<11, (void (*)(siginfo_t*, ucontext_t*))(&seastar::sigsegv_action(siginfo_t*, ucontext_t*))>()::'lambda'(int, siginfo_t*, void*)::operator()(int, siginfo_t*, void*) const at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:4194
 (inlined by) void seastar::install_oneshot_signal_handler<11, (void (*)(siginfo_t*, ucontext_t*))(&seastar::sigsegv_action(siginfo_t*, ucontext_t*))>()::'lambda'(int, siginfo_t*, void*)::__invoke(int, siginfo_t*, void*) at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:4189
/home/willem/redpanda/vbuild/bazel-out/redpanda/bin/../lib/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=cd410b710f0f094c6832edd95931006d883af48e, for GNU/Linux 3.2.0, stripped
?? at ??:0
cloud_io::remote::download_stream(cloud_io::basic_transfer_details<seastar::lowres_clock>, seastar::noncopyable_function<seastar::future<unsigned long> (unsigned long, seastar::input_stream<char>)> const&, std::__1::basic_string_view<char, std::__1::char_traits<char>>, bool, std::__1::optional<std::__1::pair<unsigned long, unsigned long>>, std::__1::function<void (unsigned long)>)::$_0::operator()() const at ./src/v/cloud_io/remote.cc:316
 (inlined by) cloud_io::remote::download_stream(cloud_io::basic_transfer_details<seastar::lowres_clock>, seastar::noncopyable_function<seastar::future<unsigned long> (unsigned long, seastar::input_stream<char>)> const&, std::__1::basic_string_view<char, std::__1::char_traits<char>>, bool, std::__1::optional<std::__1::pair<unsigned long, unsigned long>>, std::__1::function<void (unsigned long)>) at ./src/v/cloud_io/remote.cc:314
cloud_topics::l1::file_io::read_object(cloud_topics::l1::object_extent, seastar::abort_source*) (.resume) at ./src/v/cloud_topics/level_one/common/file_io.cc:221
std::__1::coroutine_handle<seastar::internal::coroutine_traits_base<std::__1::expected<seastar::input_stream<char>, cloud_topics::l1::io::errc>>::promise_type>::resume[abi:ne200100]() const at ./external/toolchains_llvm++llvm+current_llvm_toolchain/bin/../../toolchains_llvm++llvm+current_llvm_toolchain_llvm/bin/../include/c++/v1/__coroutine/coroutine_handle.h:144
 (inlined by) seastar::internal::coroutine_traits_base<std::__1::expected<seastar::input_stream<char>, cloud_topics::l1::io::errc>>::promise_type::run_and_dispose() at ./external/+non_module_dependencies+seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:2747
seastar::reactor::task_queue_group::run_tasks() at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:3251
 (inlined by) seastar::reactor::task_queue_group::run_some_tasks() at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:3235
 (inlined by) seastar::reactor::do_run() at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:3418
seastar::reactor::run() at ./external/+non_module_dependencies+seastar/src/core/reactor.cc:3295
seastar::app_template::run_deprecated(int, char**, std::__1::function<void ()>&&) at ./external/+non_module_dependencies+seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::__1::function<seastar::future<int> ()>&&) at ./external/+non_module_dependencies+seastar/src/core/app-template.cc:160
application::run(int, char**) at ./src/v/redpanda/application.cc:305
main at ./src/v/redpanda/main.cc:22
?? at ??:0
__libc_start_main at ??:0
_start at ??:0

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

This is dangerous because capturing `r = reservation_fut.get()` and then
passing a pointer to it to `save_to_cache()` will result in a use-after-free
should the future be resolved after the `consumer` lambda is destructed.

Change the lambda signature to only capture values that will be moved
or copied into `save_to_cache()` to ensure lifetime safety.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a use-after-free (UAF) bug in file_io::read_object() where a pointer to a stack-allocated reservation object was being captured and passed to an asynchronous callback, potentially causing the pointer to dangle after the lambda's destruction.

Key Changes:

  • Changed save_to_cache() to accept space_reservation_guard by value instead of by pointer
  • Updated lambda capture to move ownership of the reservation and cache_key instead of capturing references

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/v/cloud_topics/level_one/common/file_io.h Changed save_to_cache() signature to accept reservation guard by value
src/v/cloud_topics/level_one/common/file_io.cc Updated implementation to move reservation ownership and eliminate pointer usage

@WillemKauf
Copy link
Contributor Author

WillemKauf commented Jan 7, 2026

Hm. Actually not sure about this fix lining up with the backtrace as posted. 🤔

EDIT1: Stared at this for a bit and I think it's a real fix as described?

EDIT2: Unsure again.

@WillemKauf WillemKauf requested review from Lazin and wdberkeley January 7, 2026 20:01
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#78654
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ConsumerGroupBalancingTest test_coordinator_nodes_balance null integration https://buildkite.com/redpanda/redpanda/builds/78654#019b99c1-e01d-4c26-b788-f66c99c3c4ca FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ConsumerGroupBalancingTest&test_method=test_coordinator_nodes_balance
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/78654#019b99c1-5ecd-486d-8f31-4e8ea7f9443f FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_higher_level_migration_api
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/78654#019b99c1-e01e-42ec-b46e-c240e435e1c4 FLAKY 15/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1148, p0=0.0709, reject_threshold=0.0100. adj_baseline=0.3063, p1=0.3926, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@WillemKauf
Copy link
Contributor Author

Closing for now, I don't believe this is a fix for the crash observed.

@WillemKauf WillemKauf closed this Jan 7, 2026
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it's not clear how the UAF would occur--the code looks like it doesn't have the UAF as described.
  2. passing the reservation by value seems to ignore the original purpose of the reference which was to use the same reservation across multiple invocations of the lambda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants