Skip to content

CORE-7718 metadata #23508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

michael-redpanda
Copy link
Contributor

@michael-redpanda michael-redpanda commented Sep 26, 2024

  • permit setting the directory to use to output the debug bundle
  • Adds a metadata object to the debug bundle service
  • Upon completion of an debug bundle run (regardless of success), will enqueue that into the kvstore
  • When service restarts, will check kvstore to see if metadata exists and if it does will 'reload' the state

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • None

@michael-redpanda michael-redpanda requested a review from a team September 26, 2024 12:49
@michael-redpanda michael-redpanda self-assigned this Sep 26, 2024
@michael-redpanda michael-redpanda requested a review from a team as a code owner September 26, 2024 12:49
@michael-redpanda michael-redpanda requested review from oleiman and removed request for a team September 26, 2024 12:49
@michael-redpanda michael-redpanda changed the title Core 7718 metadata CORE-7718 metadata Sep 26, 2024
@michael-redpanda
Copy link
Contributor Author

Force push fee45fc:

  • Cleaned up a couple commits

@michael-redpanda
Copy link
Contributor Author

Force push 7ea5560:

  • Rebased off of dev

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished my review, but it's moving underneath me.

@@ -250,7 +292,7 @@ ss::future<result<void>> service::initiate_rpk_debug_bundle_collection(
job_id,
co_await external_process::external_process::create_external_process(
std::move(args)),
form_debug_bundle_file_path(_debug_bundle_dir, job_id));
form_debug_bundle_file_path(output_dir, job_id));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I see this gets changed to debug_bundle_file_path later, but this seems like an opportune moment.

@BenPope BenPope self-requested a review September 26, 2024 18:02
@michael-redpanda
Copy link
Contributor Author

Force push 4cda960:

  • Fixes from PR comments
  • Addressed cmake build issue
  • Addressed bazel build issue

@michael-redpanda
Copy link
Contributor Author

Force push 21db2f8:

  • Fixed clang format issue

BenPope
BenPope previously approved these changes Sep 27, 2024
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a commit that has a cherry-pick message with a sha that won't be useful.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but want to resolve the kvstore question before merging.

@@ -35,6 +35,7 @@ concept Vector = requires(T t) {
t.begin();
t.end();
{ t.size() } -> std::convertible_to<std::size_t>;
t.shrink_to_fit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the commit message doesn't seem to indicate that this was a problem prior to the commit, so i'm wondering: is this related to the previous commit that added shrink_to_fit to the bytes type? if so, we may have an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's related. I'm storing the sha256 checksum as bytes into the kvstore. I can switch to use fragmented_vector if that's an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's related. I'm storing the sha256 checksum as bytes into the kvstore. I can switch to use fragmented_vector if that's an issue.

but bytes should not be being handled by the vector serde encoder. i think the issue is that the serde encoder is working with "things that look like vectors". i think can drop the shrink-to-fit and the rw/vector.h and include rw/bytes.h?

separately ill take a todo item to make this stricter, or maybe they should be combined if their on-wire formats are identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include rw/bytes.h?

oh my bad, yeah good call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say that the change in rw/vector.h I think is still good (though probably no longer relevant to this PR), since the tag_invoke(tag_t<read_tag>) function calls shrink_to_fit()

ssx::spawn_with_gate(_gate, [this, job_id]() {
return _rpk_process->wait()
.then([this, job_id](auto) { return handle_wait_result(job_id); })
.handle_exception_type([](const std::exception& e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean handle_exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh

Comment on lines 645 to 724
_rpk_process->cout().copy(),
_rpk_process->cerr().copy(),
job_id,
debug_bundle_file,
std::move(sha256_checksum),
_rpk_process->get_wait_result());

iobuf buf;
serde::write(buf, std::move(md));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvstore isn't really designed to handle large data. so storing stdout/stderr is a bit concerning here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big is too big? I want to say this is a few hundred bytes in the common case, but I'm not too sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if go crashed and produced a large backtrace? or whatever there is a bug in the collector and it dumped a bunch of stuff out.

how big is too big?

yeh good question--it is stored in ram for the lifetime of the key, and there is no option not to store it in ram.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could store it on disk as a file with the same key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was planning on running rpk debug bundle with the --verbose flag which would probably increase the amount of data in stdout/stderr, which I guess could produce a bit of information.

Why keep around stdout/stderr?

In case a previous run of rpk debug bundle failed and the broker reset. Maybe we'd like to know why it failed. Or on success, if there was anything interesting in the verbose output that may help with debugging a problem.

Why kvstore?

Originally I was going to output a json metadata file with all this information in it, but as reminded about the existence of the kvstore, so switched to doing that.

So to Oren's question, how big is too big?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to Oren's question, how big is too big?

we don't have any formal guidelines, but the intention of the kvstore in its current state is to store small metadata.

seems you could keep the metadata in the kvstore, and the stdout/stderr on disk with the same key name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to move it out of the kvstore and onto disk. If doing so I wonder if it should be JSON or it's fine to remain in a serde format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems you could keep the metadata in the kvstore, and the stdout/stderr on disk with the same key name?

Yeah that's probably fine something like <uuid>.stdout and <uuid>.stderr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doing so I wonder if it should be JSON or it's fine to remain in a serde format.

binary is probably fine. its presumably ascii or utf8?

oleiman
oleiman previously approved these changes Sep 27, 2024
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, few random comments but nothing pressing

@michael-redpanda michael-redpanda dismissed stale reviews from oleiman and BenPope via 3b0aaa6 September 29, 2024 23:31
@michael-redpanda
Copy link
Contributor Author

Force push 44857e2:

  • Fixed configuration description per docs suggestion

@michael-redpanda
Copy link
Contributor Author

Force push 53d077b:

  • Rebased off of dev to fix merge conflict

@michael-redpanda michael-redpanda requested a review from a team as a code owner October 1, 2024 12:09
@michael-redpanda michael-redpanda requested review from jackietung-redpanda and removed request for a team October 1, 2024 12:09
@michael-redpanda
Copy link
Contributor Author

Force push 1464c52:

  • Fixed issue during rebase

@michael-redpanda michael-redpanda force-pushed the CORE-7718-metadata branch 2 times, most recently from d129734 to ad7106c Compare October 1, 2024 13:59
Removed friendship between service and process and added accessors and
helper methods.

Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
(cherry picked from commit c483162)
Signed-off-by: Michael Boquard <[email protected]>
(cherry picked from commit 55b414b)
These structures are used to store metadata and process output to the
kvstore and disk in order to maintain the state of the service between
application restarts.

Signed-off-by: Michael Boquard <[email protected]>
Now when a new process has been kicked off, the previous run will be
cleaned up.  Added tests to verify new functionality.

Signed-off-by: Michael Boquard <[email protected]>
Generating and storing metadata into kvstore and added tests to validate
it.

Signed-off-by: Michael Boquard <[email protected]>
Added functionality that will reload metadata from the kvstore after the
service restarts.

Signed-off-by: Michael Boquard <[email protected]>
I didn't touch it so no idea how this snuck in.

Signed-off-by: Michael Boquard <[email protected]>
@michael-redpanda
Copy link
Contributor Author

Force push d20e80c:

  • Fixed conflict with storage/BUILD

Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. largely taking your word for it on the changes to unit tests

ssx::spawn_with_gate(_gate, [this, job_id]() {
return _rpk_process->wait()
.then([this, job_id](auto) {
auto hold = _gate.hold();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure I get why you need to hold the gate here. is it because the background task completes when you return the _process_control_mutex.get... future below?

I don't write enough continuation style code 😅

@michael-redpanda michael-redpanda merged commit 4327ada into redpanda-data:dev Oct 1, 2024
18 checks passed
michael-redpanda added a commit to michael-redpanda/redpanda that referenced this pull request Oct 2, 2024
Thanks to some merge ordering schenangians, these tests started failing
when redpanda-data#23557 merged after redpanda-data#23508.  This change addresses the test bug by
properly obtaining the configuration property and handling a situation
when the debug bundle directory configuration is empty.

Signed-off-by: Michael Boquard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants