CORE-8618 crash_tracker: record uncaught startup exceptions #24854

pgellert · 2025-01-17T15:47:55Z

This PR adds functionality to record crashes caused by an exception during startup as crash reports to disk. This is the first step to implement the core crash recording project which aims to persist information about redpanda broker crashes to disk to help debugging.

The PR implements the core functionality of preparing state to allow writing crash reports in an async-signal-safe way, so that this code can be reused in signal handlers in follow-up PRs.

The crash reports are going to be stored under the data directory. The filenames are constructed as: ${datadir}/crash_reports/${broker_start_time_ms}_${random}.crash.

Reports are serialized using serde and iobuf (with pre-reserved) memory, and the implementation relies on their allocation-free behaviour to ensure that writing crash files in signal handlers is reliable even in OOM scenarios. (This will become important in a follow-up PR when the signal handlers are actually hooked up.)

To write to the crash reports, we need to use the low-level open() function here instead of the seastar API or higher-level C++ primitives because we need to be able to manipulate the file using async-signal-safe, allocation-free functions inside signal handlers. Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html

Fixes https://redpandadata.atlassian.net/browse/CORE-8618

Backports Required

Release Notes

none

vbotbuildovich · 2025-01-17T19:55:08Z

CI test results

test results on build#60910

test_id	test_kind	job_url	test_status	passed
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60910#01947552-5d4b-4150-bcdb-b7b1bb0fd156	FLAKY	1/2

test results on build#60955

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60955#01948420-a364-4301-adf9-f2249f270189	FLAKY	1/2
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60955#01948420-a364-4464-a871-097ffe11e6c2	FLAKY	1/2

test results on build#61003

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61003#01948a54-d3a4-4eaf-b3e8-9ebafd176099	FLAKY	1/2

test results on build#61107

test_id	test_kind	job_url	test_status	passed
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes	ducktape	https://buildkite.com/redpanda/redpanda/builds/61107#019493c6-079e-4f88-acb1-3b6e6ad619e3	FLAKY	1/2

test results on build#61168

test_id	test_kind	job_url	test_status	passed
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=False.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=False.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=None.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=None.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=True.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=False.enable_authz=True.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=False.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=False.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=None.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=None.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=True.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=False.use_sasl=True.enable_authz=True.authn_method=mtls_identity.client_auth=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=False.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=False.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=False.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=False.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=False.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=False.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=True.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=True.enable_authz=False.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=True.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=True.enable_authz=None.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=True.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bc-4e01-befe-f95532eef1e1	FAIL	0/20
rptest.tests.acls_test.AccessControlListTestUpgrade.test_describe_acls.use_tls=True.use_sasl=True.enable_authz=True.authn_method=mtls_identity.client_auth=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FAIL	0/20
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ad-4677-9090-9a555945bf09	FLAKY	1/2
rptest.tests.configuration_update_test.ConfigurationUpdateTest.test_update_invalid_advertised_api	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4be-4276-8e20-77b97022d62e	FAIL	0/20
rptest.tests.configuration_update_test.ConfigurationUpdateTest.test_update_invalid_advertised_api	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ac-40e5-b7df-61bbd218a718	FAIL	0/20
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ab-481d-a5cf-b3ebee7f7d93	FLAKY	1/3
rptest.tests.strict_data_init_test.StrictDataInitTest.test_strict_data_init_enabled	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#019498fd-b4bd-4a0d-9a26-0c6ad6d08dbf	FAIL	0/20
rptest.tests.strict_data_init_test.StrictDataInitTest.test_strict_data_init_enabled	ducktape	https://buildkite.com/redpanda/redpanda/builds/61168#01949910-f1ad-4677-9090-9a555945bf09	FAIL	0/20

test results on build#61179

test_id	test_kind	job_url	test_status	passed
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61179#019499bf-6eb1-4b62-b0ec-88bfa698e99d	FLAKY	1/2
rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed	ducktape	https://buildkite.com/redpanda/redpanda/builds/61179#01949a1b-10ac-4d9f-9f68-8b1483d62b12	FLAKY	1/2

pgellert · 2025-01-20T13:17:25Z

force-push: rebase to pgellert:crashlog/define-api to update the first 4 commits of this stacked PR with the latest changes

dotnwat

this is looking pretty cool!

src/v/config/node_config.h

src/v/crash_tracker/prepared_writer.cc

dotnwat · 2025-01-20T17:45:01Z

src/v/crash_tracker/prepared_writer.cc

+    // Create the crash recorder file
+    auto f = co_await ss::open_file_dma(
+      _crash_report_file_name.c_str(),
+      ss::open_flags::create | ss::open_flags::rw);


truncate
exclusive

dotnwat · 2025-01-20T17:46:19Z

src/v/crash_tracker/prepared_writer.cc

+      _crash_report_file_name.c_str(),
+      ss::open_flags::create | ss::open_flags::rw);
+    co_await f.close();
+    co_await ss::sync_directory(_crash_report_file_name.parent_path().string());


are you protecting against a certain scenario by syncing the directory here? a comment would probably be helpful since it's likely important?

I've added this comment:

Sync the parent dir to ensure that the new file is observable to the ::open() call below and to later restarts of the process

Let me know if that explains it well enough.

it doesn't hurt anything, but you don't really need to do the sync for these cases--it's only good for if the actual node crashes.

Makes sense. I've removed it now since it's not necessary.

dotnwat · 2025-01-20T17:52:35Z

src/v/crash_tracker/prepared_writer.cc

+    for (const auto& frag : _serde_output) {
+        auto written = ::write(_fd, frag.get(), frag.size());
+        if (written == -1) {
+            return false;


I wouldn't expect us to be able to continue to write to the file descriptor if an earlier ::write() call failed. Am I missing something here? But I also don't see any problems with continuing to try to write (the man page doesn't say that this would be a problem), so I've changed the code to continue to try to write here.

dotnwat · 2025-01-20T17:53:29Z

src/v/crash_tracker/prepared_writer.cc

+        co_await ss::sync_directory(
+          _crash_report_file_name.parent_path().string());


sync is probably unnecessary, but maybe you could provide a little context on what you're trying to protect against?

This is the comment I've added:

Sync the parent dir to ensure that later restarts of the process cannot observe this file

Let me know if it explains it well enough.

visibility is handled by the kernel. process restarts don't need the sync. it doesn't hurt anything, except performance if you were doing many syncs.

src/v/crash_tracker/recorder.cc

dotnwat · 2025-01-20T18:04:26Z

src/v/crash_tracker/recorder.cc

+    std::lock_guard<ss::util::spinlock> g(_writer_lock);
+    co_await _writer.release();


can you comment on the use of a spinlock here compared to a blocking mutex? generally i would think that holding a spinlock for anything other than the tiniest of cpu-bound critical sections (and release here is performing I/O) would not perform as well as a blocking mutex where the kernel can put a blocking thread to sleep and understand the chain of dependencies.

I've added this comment now:

The writer has shared state, so accessing it from multiple threads
is guarded by this lock. We are using the async-signal safe
ss::util::spinlock instead of more efficient locking mechanisms like the
seastar primitives or std::mutex to ensure that we can safely use the
lock in a singal handler context.

This lock will be used in signal handlers in a follow-up PR so it needs to be async-signal safe. The pthread mutex isn't: https://man7.org/linux/man-pages/man3/pthread_mutex_lock.3.html.
The standard seastar concurrency primitives are also not available in the signal handler because the signal handler runs outside of the reactor. ss::util::spinlock is specifically an async-signal safe spinlock that seastar uses in its signal handlers: https://github.com/scylladb/seastar/blob/c3f33c116ecde6f37cea6f656e1dfed2a4cd26a2/include/seastar/util/spinlock.hh#L85-L88

I am not worried about the performance overhead of the spinlock because all of the code paths where the lock is used are outside of performance-critical sections (startup, shutdown, mid-crash), and I don't expect the lock to be contended except for some highly unlikely scenarios (eg. segfault while already recording a failure).

dotnwat · 2025-01-20T18:06:12Z

src/v/crash_tracker/recorder.cc

+    // If the recorder is already locked, give up to prevent possible deadlocks.
+    // For example, an assertion failure while recording a segfault may lead to
+    // a deadlock if we unconditionally waited for the lock here.
+    std::unique_lock<ss::util::spinlock> g(_writer_lock, std::try_to_lock);


same here. why a spinlock?

Answered here: #24854 (comment)

src/v/crash_tracker/prepared_writer.cc

pgellert · 2025-01-21T16:11:13Z

force-push: address code review feedback

Defines it under `${datadir}/crash_reports` to store crash reports. It needs to be special-cased in the `compute_storage.py` script to ensure it isn't mistaken for a log file directory.

Define a custom exception for `crash_loop_limit_reached` to allow pattern matching for it in a follow up commit. We specifically do not want to record `crash_loop_limit_reached` errors as crashes because they are generally not informative crashes. Recording them as real crashes would build up garbage on disk, which would lead to real crash logs expiring on disk earlier.

pgellert · 2025-01-21T18:12:45Z

force-push: rebased to dev now that the base PR (#24828) merged

michael-redpanda

lgtm, just one remaining question

src/v/crash_tracker/recorder.cc

BenPope · 2025-01-22T14:33:35Z

src/v/crash_tracker/prepared_writer.cc

+
+void prepared_writer::write() {
+    vassert(_state == state::filled, "Unexpected state: {}", _state);
+    _state = state::written;


Is this transition correct if try_write_crash() fails? Should failure be represented in the state machine?

Is this transition correct if try_write_crash() fails?

Yes, it's correct though I get that the naming is a bit misleading. Maybe consumed would be a better name to represent this state.

Should failure be represented in the state machine?

We could add a separate failure state in the state machine but at the moment it is identical to the written state so I think it's simpler as is.

BenPope · 2025-01-22T15:04:32Z

src/v/crash_tracker/recorder.cc

+          cd.stacktrace.begin() + pos,
+          cd.stacktrace.size() - pos,


nitpick: You could use std::next/prev if there's a lint complaint about pointer arithmetic here.

Nice, thank you! Let me carry this onto a follow-up PR unless any more changes come up for this PR.

I kept it as cd.stacktrace.begin() + pos because std::next(cd.stacktrace.begin(), static_cast<std::ptrdiff_t>(pos)) seems overly verbose, and it's not failing the automated linting checks.

dotnwat · 2025-01-22T19:01:35Z

src/v/crash_tracker/prepared_writer.cc

+        while (written < frag.size()) {
+            auto res = ::write(
+              _fd, frag.get() + written, frag.size() - written);
+            if (res == -1) {


did you consider retrying on EINTR? it's niche, but still possible. and you did state that you were avoiding seastar I/O because you cared about signal handling...

Maybe also EAGAIN, but I don't think this should happen here

did you consider retrying on EINTR?

Yeah, it makes sense to retry on EINTR. I've added this now.

and you did state that you were avoiding seastar I/O because you cared about signal handling...

I care about signal handler and seastar I/O for a different reason. It's that it is not safe to perform seastar I/O in a signal handler. Nevertheless, handling signals interrupting I/O (EINTR) is a net reliability improvement, so it makes sense to add it.

Maybe also EAGAIN, but I don't think this should happen here

The man page tells me EAGAIN is only for (non-blocking) sockets, so I don't think we need to handle that here. (https://man7.org/linux/man-pages/man2/write.2.html)

dotnwat · 2025-01-22T19:04:44Z

src/v/crash_tracker/prepared_writer.cc

+    auto f = co_await ss::open_file_dma(
+      _crash_report_file_name.c_str(),
+      ss::open_flags::create | ss::open_flags::rw | ss::open_flags::truncate
+        | ss::open_flags::exclusive);


my original comment about using exclusive here was actually a note to myself to revisit my comment after finishing the review, but i forgot. the problem with exclusive is that this open will fail if the file already exists on disk. is that going to be an issue, or is exclusive behavior desired? my hunch is that you may not want it, but the note to myself was just to ask if you wanted it.

is that going to be an issue, or is exclusive behavior desired?

Exclusive is beneficial but not required here. We always want to create a not-yet-existing file here to ensure that we are not overwriting previous crash reports. The commit crash_tracker: prepare and release crash log file ensures that the file we open doesn't exist yet. Using exclusive here asserts that the file does not exist yet, which I believe is useful, at least as a sanity check.

dotnwat · 2025-01-22T19:07:07Z

src/v/crash_tracker/prepared_writer.cc

+        co_await ss::sync_directory(
+          _crash_report_file_name.parent_path().string());


visibility is handled by the kernel. process restarts don't need the sync. it doesn't hurt anything, except performance if you were doing many syncs.

dotnwat · 2025-01-22T19:07:51Z

@pgellert the note about exclusive flag is probably the importnat one

dotnwat · 2025-01-22T19:41:00Z

src/v/crash_tracker/recorder.cc

+}
+
+ss::future<> recorder::stop() {
+    std::lock_guard<ss::util::spinlock> g(_writer_lock);


commented here #24883 (review) not sure exactly the relationship between these PRs in terms of where reviews should happen.

Thanks, yeah let's discuss it on this PR. #24883 stacks on this one and only the last 4 commits are new there. Github's not great with stacked PRs so I just usually just add a bolded note about it to the cover letter but it's still easy to miss.

can you comment on what you mean by async-signal safe here and what concurrency you are trying to limit, exactly?

@michael-redpanda had a similar question I answered above: #24854 (comment)
async-signal safety: https://man7.org/linux/man-pages/man7/signal-safety.7.html

if the you're holding this lock and the current thread handles a signal that tries to acquire the same lock, then it's going to be a deadlock, no?

Not with try_lock. I am planning to use try_lock to acquire the lock in the signal handler to make sure it cannot deadlock. You can see the target final implementation in my POC PR's recorder.cc file here: https://github.com/redpanda-data/redpanda/pull/24493/files#diff-f90b64a3764bede06ac5ec9f14bd838d8d3147eefcc70cb6b8b7817b3a74ed95

maybe we can design this so that we don't need any locking? holding a spinlock while doing I/O and co_awaiting futures doesn't feel like a great idea.

I chose the spinlock because it is a primitive provided by seastar, so I prefer to use that instead of using raw std::atomic. But I can be convinced otherwise. An alternative I can think of is making prepared_writer thread-safe by using CAS operations on the _state held in prepared_writer.

Let me know what you think. If given this context you think that a spinlock is reasonable I would add change the comment around it to the following:

/// The writer has shared state, so accessing it from multiple threads /// is guarded by this lock. We use the async-signal-safe /// `ss::util::spinlock` instead of more efficient locking mechanisms (e.g., /// Seastar primitives or `std::mutex`) because this lock needs to be safe /// to acquire in a signal handler context. For reference, see the /// signal-safety guidelines: /// https://man7.org/linux/man-pages/man7/signal-safety.7.html. /// /// While this lock ensures mutual exclusion for access to `_writer`, it is /// also used to ensure the visibility of changes to `_writer`'s state /// across threads. For example, if `start()` is called on thread T0, /// `record_crash_sighandler()` on thread T3, and `stop()` again on T0, the /// lock guarantees visibility and prevents data races in the state of /// `_writer`. /// /// Note: care must be taken to avoid deadlock. In signal handlers, only use /// try_lock to try to ensure the signal handler is not blocked forever.

if all you are trying to do is safely set a flag so that a shutdown signal handler doesn't conflict with other executions, you could use the spinlock around access to the flag only. holding the spinlock around co_await sounds like a recipe for problems?

also, is there any information on why we are writing all the signal handling ourselves? seastar has ss::engine::handle_signal too, but presumably that isn't good enough?

if all you are trying to do is safely set a flag so that a shutdown signal handler doesn't conflict with other executions, you could use the spinlock around access to the flag only. holding the spinlock around co_await sounds like a recipe for problems?

I believe the spinlock is also protecting the buffers held by prepared_writer to ensure multiple threads don't attempt to write to the buffers at the same time.

also, is there any information on why we are writing all the signal handling ourselves? seastar has ss::engine::handle_signal too, but presumably that isn't good enough?

We can dig into this, but reading through the code I think it replaces the current action with whatever function you provide it. So if we used it to register say SIGSEGV, I believe it replaces Seastar's SEGV handler. Very much could be wrong but we'll look into it. Also I believe we want to ensure a certain ordering (either our handlers first or Seastar's handlers first, can't exactly remember).

is there any information on why we are writing all the signal handling ourselves? seastar has ss::engine::handle_signal too, but presumably that isn't good enough?

The main reason is that Seastar's ss::engine::handle_signal is designed for efficiently handling user signals (e.g., SIGUSR1, SIGUSR2, SIGINT, SIGTERM, SIGHUP) but it is unsuitable for crash signals (like SIGABRT during an OOM scenario). Signal handlers configured through handle_signal are scheduled to run asynchronously on the reactor (the relevant code is here: https://github.com/scylladb/seastar/blob/1822136684ed0942e49e5299e86393534d465633/src/core/reactor.cc#L730C24-L749). However, for crash scenarios we need a synchronous signal handler to execute minimal, safe actions (e.g., writing out crash data) before termination, without relying on the reactor to pick up the signal later, as that would be unreliable.

Regarding signal handler ordering: Seastar's handler terminates the process as its last step, so redpanda's signal handler must execute first. I intend to wrap Seastar's handler to ensure it still runs after ours, although we could technically just replace seastar's signal handler wholesale with ours. (We can also discuss this further in the upcoming signal handler PR.)

if all you are trying to do is safely set a flag so that a shutdown signal handler doesn't conflict with other executions, you could use the spinlock around access to the flag only. holding the spinlock around co_await sounds like a recipe for problems?

We won't actually block on this spinlock ever. I've reworked the code now to assert that we never block on the spinlock instead of just knowing that this is the case.

Mainly I'm using the lock to provide visibility guarantees. With the caveat that I do use it to prevent sighandlers from concurrently accessing the writer while start/stop/record_crash_exception/another call of record_crash_sighandler is running.

We could replace the spinlock with raw atomics. It would look something like this: e82f211. But I personally prefer the lock version because it is easier to work with (lock_guard > ss::defer, no need to select memory ordering).

As discussed offline, I've reworked this now to use two std::atomic<bool>s for concurrency control instead of the spinlock.

pgellert · 2025-01-23T14:06:31Z

force-push: address code review (remove directory sync, avoid blocking on spinlock, handle EINTR) + remove some whitespace

pgellert · 2025-01-24T14:26:22Z

force-push: reworked the concurrency control around prepared_writer to use two std::atomic<bool>'s instead of a ss::util::spinlock. This is to avoid holding locks around co_await scheduling points and to better represent the visibility and mutual exclusion requirements of prepared_writer.

vbotbuildovich · 2025-01-24T17:45:40Z

Retry command for Build#61168

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/strict_data_init_test.py::StrictDataInitTest.test_strict_data_init_enabled
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":true,"enable_authz":false,"use_sasl":false,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":true,"enable_authz":null,"use_sasl":false,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":true,"enable_authz":true,"use_sasl":false,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":true,"enable_authz":false,"use_sasl":true,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":true,"enable_authz":null,"use_sasl":true,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":true,"enable_authz":true,"use_sasl":true,"use_tls":false}
tests/rptest/tests/configuration_update_test.py::ConfigurationUpdateTest.test_update_invalid_advertised_api
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":false,"use_sasl":false,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":null,"use_sasl":false,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":true,"use_sasl":false,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":false,"use_sasl":true,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":null,"use_sasl":true,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":true,"use_sasl":true,"use_tls":false}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":false,"use_sasl":false,"use_tls":true}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":null,"use_sasl":false,"use_tls":true}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":true,"use_sasl":false,"use_tls":true}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":false,"use_sasl":true,"use_tls":true}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":null,"use_sasl":true,"use_tls":true}
tests/rptest/tests/acls_test.py::AccessControlListTestUpgrade.test_describe_acls@{"authn_method":"mtls_identity","client_auth":false,"enable_authz":true,"use_sasl":true,"use_tls":true}

pgellert · 2025-01-24T19:21:07Z

force-push: address @michael-redpanda's review comments left on #24883 (#24883 (comment), #24883 (comment), #24883 (comment))

reduce duplication by introducing a std::atomic<state> instead of maintaining both the state and two std::atomic<bool>s
Fix the test failures. These were due to improper handling of startup exceptions thrown before the crash tracker was initialized. Eg. throwing in check_environment() cannot be recorded because it is before init_crashtracker() sets up the crash recording state. This bug was caught by the vasserts in the state machine of prepared_writer and was the cause of the ducktape test failures above.

redpanda/src/v/redpanda/application.cc

Lines 501 to 502 in 6651515

check_environment();

init_crashtracker(app_signal);

src/v/crash_tracker/prepared_writer.cc

Introduces the core stateful writer which is used to write out `crash_description` serde objects to crash report files. It is able to write out the crash reports in an async-signal-safe way by: * using only async signal safe syscalls * pre-allocating all the memory necessary to construct, serialize and write our `crash_description` objects to disk * not allocating in fill() and write(), i.e. on the signal handler path Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html The `prepared_writer` is hooked up to the `recorder` in a later commit of this PR. Note that fill() may race with other calls of fill() or with release(), and the class safely breaks this race. These races can occur if multiple signals arrive on different threads at the same time, or if a signal arrives concurrently with release() being called on shard 0.

Implement the logic to choose a unique crash log file name, and hook up the code for preparing the crash recording code and the crash file cleanup code on clean shutdown.

Implement the recording of startup exceptions into crash files.

Add some simple ducktape tests to assert that crash reports are being generater (or not generated) as expected. The contents of the crash files are going to be validated as a follow up once the contents are actually printed to logs.

pgellert · 2025-01-27T11:04:59Z

force-push: improve the code comments, otherwise it is a noop

.

pgellert requested a review from a team January 17, 2025 15:47

pgellert self-assigned this Jan 17, 2025

pgellert requested review from michael-redpanda and a team and removed request for a team and michael-redpanda January 17, 2025 15:47

pgellert requested a review from a team as a code owner January 17, 2025 15:47

pgellert requested review from BenPope, michael-redpanda and IoannisRP and removed request for a team January 17, 2025 15:47

github-actions bot added area/build area/redpanda labels Jan 17, 2025

pgellert marked this pull request as draft January 17, 2025 15:48

pgellert marked this pull request as ready for review January 17, 2025 15:48

pgellert force-pushed the crashlog/record-startup-exc branch from 1dc3a86 to 09ccff8 Compare January 20, 2025 13:15

dotnwat reviewed Jan 20, 2025

View reviewed changes

michael-redpanda reviewed Jan 21, 2025

View reviewed changes

pgellert force-pushed the crashlog/record-startup-exc branch from 09ccff8 to c7e2ba2 Compare January 21, 2025 16:10

pgellert requested review from michael-redpanda and dotnwat January 21, 2025 16:11

pgellert added 2 commits January 21, 2025 18:10

config: define crash_reports directory

1fae3d0

Defines it under `${datadir}/crash_reports` to store crash reports. It needs to be special-cased in the `compute_storage.py` script to ensure it isn't mistaken for a log file directory.

pgellert force-pushed the crashlog/record-startup-exc branch from c7e2ba2 to cfae5ae Compare January 21, 2025 18:11

pgellert mentioned this pull request Jan 22, 2025

CORE-8619 crash_tracker: print recorded crash reports #24883

Merged

7 tasks

michael-redpanda previously approved these changes Jan 22, 2025

View reviewed changes

src/v/crash_tracker/recorder.cc Outdated Show resolved Hide resolved

BenPope reviewed Jan 22, 2025

View reviewed changes

dotnwat reviewed Jan 22, 2025

View reviewed changes

dotnwat previously requested changes Jan 22, 2025

View reviewed changes

pgellert dismissed michael-redpanda’s stale review via d2779c7 January 23, 2025 14:05

pgellert force-pushed the crashlog/record-startup-exc branch from cfae5ae to d2779c7 Compare January 23, 2025 14:05

pgellert requested review from michael-redpanda, BenPope and dotnwat January 23, 2025 14:20

pgellert force-pushed the crashlog/record-startup-exc branch from d2779c7 to 07386cb Compare January 24, 2025 14:23

pgellert force-pushed the crashlog/record-startup-exc branch from 07386cb to ed68cc6 Compare January 24, 2025 19:14

michael-redpanda reviewed Jan 24, 2025

View reviewed changes

src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved

pgellert added 4 commits January 27, 2025 10:49

crash_tracker: prepare and release crash log file

046857b

Implement the logic to choose a unique crash log file name, and hook up the code for preparing the crash recording code and the crash file cleanup code on clean shutdown.

crash_tracker: record startup exception crashes

0b1f4f1

Implement the recording of startup exceptions into crash files.

dt: test crash report file creation and cleanup

5d3e146

Add some simple ducktape tests to assert that crash reports are being generater (or not generated) as expected. The contents of the crash files are going to be validated as a follow up once the contents are actually printed to logs.

pgellert force-pushed the crashlog/record-startup-exc branch from ed68cc6 to 5d3e146 Compare January 27, 2025 11:04

pgellert requested a review from michael-redpanda January 27, 2025 11:10

michael-redpanda approved these changes Jan 27, 2025

View reviewed changes

pgellert merged commit da553ee into redpanda-data:dev Jan 27, 2025
18 checks passed

		co_await ss::sync_directory(
		_crash_report_file_name.parent_path().string());

		std::lock_guard<ss::util::spinlock> g(_writer_lock);
		co_await _writer.release();

CORE-8618 crash_tracker: record uncaught startup exceptions #24854

CORE-8618 crash_tracker: record uncaught startup exceptions #24854

Conversation

pgellert commented Jan 17, 2025 • edited Loading

Backports Required

Release Notes

vbotbuildovich commented Jan 17, 2025 • edited Loading

CI test results

pgellert commented Jan 20, 2025

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Jan 21, 2025

pgellert commented Jan 21, 2025

michael-redpanda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Jan 23, 2025

pgellert commented Jan 24, 2025

vbotbuildovich commented Jan 24, 2025 • edited Loading

Retry command for Build#61168

pgellert commented Jan 24, 2025

pgellert commented Jan 27, 2025

pgellert commented Jan 17, 2025 •

edited

Loading

vbotbuildovich commented Jan 17, 2025 •

edited

Loading

dotnwat Jan 22, 2025 •

edited

Loading

pgellert Jan 21, 2025 •

edited

Loading

dotnwat Jan 22, 2025 •

edited

Loading

vbotbuildovich commented Jan 24, 2025 •

edited

Loading