Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): Handle reload based on referenced file change #22539

Merged
merged 18 commits into from
Mar 17, 2025

Conversation

gllb
Copy link
Contributor

@gllb gllb commented Feb 28, 2025

Summary

The TLS crt_file and key_file from http sinks are now part of the watcher list and therefore they are reloaded on Vector restart.

Ref #17283

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

adding this sink to config/vector.yaml :

http_test:
  type: "http"
  inputs: ["parse_logs"]
  uri: "http://test"
  encoding:
    codec: "json"
    json:
      pretty: true
  tls:
    crt_file: /etc/ssl/tls-cert/tls.crt
    key_file: /etc/ssl/tls-cert/tls.key

then execute :
vector -vvv -c config/vector.yaml -w
in parallel run :
cd /etc/ssl/tls-cert/ && openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj "/C=XX/ST=StateName/L=CityName/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname"
And vector logs :
image

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • Please read our Vector contributor resources.
    • make check-all is a good command to run locally. This check is
      defined here. Some of these
      checks might not be relevant to your PR. For Rust changes, at the very least you should run:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

Ref: #10877

@gllb gllb requested a review from a team as a code owner February 28, 2025 11:12
@github-actions github-actions bot added domain: topology Anything related to Vector's topology code domain: sinks Anything related to the Vector's sinks labels Feb 28, 2025
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nits but overall this looks great. Thank you @gllb

@gllb
Copy link
Contributor Author

gllb commented Mar 6, 2025

I'm not able to run cargo nextest run --workspace as it consume too much memory and get killed. But all clippy issue are fixed.

@gllb
Copy link
Contributor Author

gllb commented Mar 10, 2025

@pront, is there anything I need to do in order to make it move forward ?

@@ -543,6 +547,11 @@ impl RunningTopology {
.is_some()
}))
.collect::<Vec<_>>();

if let Some(mut components) = components_to_reload {
sinks_to_change.append(&mut components)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more thoughts on this implementation. What you have now is good and works for the HTTP sink. However, we can go further.

  1. We should support reloading any component (sources, transforms, sinks)
  2. Looking at files_to_watch , it is http sink specific. We have many components that support the same TLS settings, it should be easy to write a reusable util and expose these paths for all components that use them. (We can start by duplicating code if needed and think about DRY later.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to this PR to also implement this for all components ?
Or just make sure the code is able to be extended in future change, so make sure this shutdown_diff is able to handle source and transforms change as well ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gllb, I am looking at this ticket #10877.

So we have a few choices here:

  1. accept PR as is and follow up with multiple PRs (it doesn't have to be you btw)
  2. enhance this PR to handle all components (sources, transforms, sinks)
  3. choice 2 and handling all components that expose TLS settings

Perhaps choice (2) is good here.

Copy link
Contributor Author

@gllb gllb Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for choice 2 and that means in fact : integrate with ConfigDiff instead of patching shutdown_diff as stated in your other comment, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In order to simplify further reviews, I went ahead and enabled merging for this PR. I would appreciate it if you follow up with a PR to handle all components and a smarter config diff. But it's understandable if you don't have time for that. I think this PR is a big step towards the right direction.

@@ -336,9 +336,32 @@ async fn handle_signal(
allow_empty_config: bool,
) -> Option<SignalTo> {
match signal {
Ok(SignalTo::ReloadComponents(component_keys)) => {
Copy link
Member

@pront pront Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some implementation details:

  1. We can add component_keys (or whatever struct we end up with) as a new RunningTopology field to avoid passing them around to all these functions.
  2. Then, we can use this new field in async fn shutdown_diff(...)
  3. We can probably add the notion of changed watched paths to src/config/diff.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the component_keys here are passed when a change is detected from the watcher. If we use it as a field of RunningTopology then it supposed to be empty almost all the time, except when a change is detected. Which will still imply passing it in reload_config_and_respawn anyway, right ? or maybe I don't get it ?

For the notion of watched paths change in configDiff I think I see what you mean.

Copy link
Contributor Author

@gllb gllb Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pront So I did implement this logic in configDiff, and it does what it should. Only this part : https://github.com/gllb/vector/blob/master/src/topology/running.rs#L490-L509 is causing the sink to be re-used and therefore it is not reloaded.
On the implementation where all is done inside shutdown_diff its easy to work around it as I can append to sink_to_change after so the reuse buffer code doesnt even know about it. But since its in configDiff now, I dont see how to work around it, any idea ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could remove this part, but I dont know if its safe to do or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we agree on the src/config/diff.rs enhancement.

What is more important is how to keep track of the component type. I don't think we have an enum or something to model currently. But we probably to keep associate a ComponentKey with a new ComponentType. Talking about this, I realize it might be better to do this in a followup PR since it might need some discussion.


If we use it as a field of RunningTopology then it supposed to be empty almost all the time, except when a change is detected.

Correct. It will be empty until the watcher detects a file change, then it will push it there. When we process/reload, it will clear the whole thing. I am a bit ambivalent on this, so feel free to TIOLI.

Which will still imply passing it in reload_config_and_respawn anyway, right ?

To be clear, I was proposing adding it here:

pub struct RunningTopology {

The reload_config_and_respawn will then have access to it via self.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I can create follow up PR for it but not sure I will have a lot of time for it, lets see. I still have no idea about the struggle with reuse buffer code I pointed out in shutdown_diff but I will open a PR with what I have, so anybody can contribute as well.

@pront pront enabled auto-merge March 12, 2025 17:55
@pront
Copy link
Member

pront commented Mar 12, 2025

A failed test prevented this from merging: https://github.com/vectordotdev/vector/actions/runs/13790530146/job/38656922831?pr=22539

auto-merge was automatically disabled March 13, 2025 09:30

Head branch was pushed to by a user without write access

@gllb
Copy link
Contributor Author

gllb commented Mar 13, 2025

Hi @pront, hopefully this last fix I push solve it. I struggle a bit to run the test locally as they are very demanding on memory.

@pront
Copy link
Member

pront commented Mar 13, 2025

@gllb thank you for the fast iterations on this. If possible, avoid force pushing so I can do incremental reviews. I approved the workflows again.

@pront
Copy link
Member

pront commented Mar 13, 2025

Test failures persist. You can iterate faster locally with cargo test.

@gllb
Copy link
Contributor Author

gllb commented Mar 14, 2025

@pront cargo test is consuming too much memory locally so getting killed, and reducing the number of job to 1 -j 1 or scoping to only one test lead to same result unfortunately. If you have any clue on how to reduce this memory consumption please let me know

edit : I manage to find option in nextest to reduce build and test jobs, so now the changed test pass successfuly

@gllb gllb requested a review from pront March 17, 2025 14:27
@@ -269,8 +268,23 @@ mod tests {
{
panic!("Test timed out");
}
}
#[tokio::test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a test and fixing the failure. I realize that these tests have several problems (existing, not introduced by this PR):

  • they are disabled on macOS
  • they have arbitrary delays e.g. 3 seconds and then 3*5
  • they panic instead of returning an error

But again, this is out of scope for this PR.

@pront pront enabled auto-merge March 17, 2025 16:30
@pront pront added this pull request to the merge queue Mar 17, 2025
Merged via the queue into vectordotdev:master with commit 5e392ad Mar 17, 2025
85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: sinks Anything related to the Vector's sinks domain: topology Anything related to Vector's topology code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants