-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testsys: Add --wait
flag for test runs
#3221
Conversation
tools/testsys/src/run.rs
Outdated
#[clap( | ||
long, | ||
env = "TESTSYS_OUTPUT_PATH", | ||
parse(from_os_str), | ||
requires = "wait" | ||
)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the output-path
flag tied to the wait
flag?
Can this instead be tied to the testsys status
command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of the tests created by the invocation are output to this output-path. We should also add the same feature to status, but in a separate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr is enabling automation to call cargo make -e TESTSYS_WAIT=true -e TESTSYS_OUTPUT_PATH=foo.json test
and then immediately use jq
on foo.json
to get the results of the test without having to deal with TestSys itself.
tools/testsys/src/wait.rs
Outdated
info!( | ||
"Still waiting for resource '{resource_name}' to be deleted from the cluster. Sleeping 60s" | ||
); | ||
tokio::time::sleep(Duration::from_secs(60)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 seconds between polls seem a bit long to me. Any reasoning behind this?
Can we capture the duration of the wait in a const can reuse it whereever it applies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 seconds was arbitrary, but I didn't want to go too short since it could be polling for hours, and 60 seconds didn't seem too bad. I'll bring it out as a constant anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for something like 10 just so you don't waste time when you're doing a quick
run, i.e. if this becomes part of ones developer flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a configurable overall timeout, i.e. we might want the --wait
invocation to timeout after 3 hours instead of hanging indefinitely.
tools/testsys/src/wait.rs
Outdated
info!( | ||
"Still waiting for resource '{resource_name}' to be deleted from the cluster. Sleeping 60s" | ||
); | ||
tokio::time::sleep(Duration::from_secs(60)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for something like 10 just so you don't waste time when you're doing a quick
run, i.e. if this becomes part of ones developer flow.
tools/testsys/src/wait.rs
Outdated
} | ||
|
||
#[derive(Debug)] | ||
pub enum TestRunResults { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A plural noun implies some sort of collection.
pub enum TestRunResults { | |
pub enum TestRunResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a configurable overall timeout, i.e. we might want the --wait invocation to timeout after 3 hours instead of hanging indefinitely.
We definitely don't want a default overall timeout. A sonobuoy test that is waiting for another sonobuoy test to finish would certainly take more than 3 hours. I'll add a timeout for waiting for other completed resources to be deleted (10 mins). I will also add an optional overall timeout.
^
|
I'm wondering if TestSys should also stream the logs from the agent's it is watching? Maybe we could add that in a follow up pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the same timeout being used for these two things?
- Timeout on a long-running test
- Timeout on a resource creation error
What about timeout on a long-running resource creation or deletion?
Have you prevented --wait
from running for 12 hours when no progress is being made and no errors have been reported?
The timeout for a long running test is the responsibility of the test agent not the run command. A test could take 24 hours to run as its standard time and we shouldn't hard code something that prevents that. Each of the timeouts you are worried about are timeouts that should be implemented elsewhere (mainly the agent itself). Adding any sort of timeout will create a situation where a new test that may take longer than others doesn't work without changing the hard coded timeouts. If you wanted a 3 hour timeout, you can set it with |
My original idea, which still stands, is a timeout for the I also don't think that this command should attempt a cleanup of resources and tests when something bad happens. I don't think there should be this complex and hard to understand timeout window in which you can fix the problem. Instead I would expect the |
Is the configurable timeout acceptable? I am strongly against a default timeout of the The
There currently isn't anything being cleaned up after the test is complete. I have a local commit that will enable automatic clean up, but I think that belongs in a separate pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've asked @cbgbt to take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it's best to rate-limit requests to external services. Retry loops should use exponential backoff with jitter.
I also agree with @webern that a default (though large) timeout is a wise move. Unbounded waiting on a tool intended to be used in CI can be painful for a variety of reasons.
debug!("Testing completed writing results"); | ||
if let Some(output) = self.output_path { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems odd to pair the output-path
with --wait
. I wonder if this rendering logic should live in a function that would be callable when we add the same capability to testsys status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean.
info!("Successfully added '{}'", crd.name().unwrap()); | ||
} | ||
|
||
if self.wait { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is too large. I think it could be greatly helped by moving the variant creator above to some kind of Factory
-style separate method.
I think similarly abstracting out the wait_for_crds
call to one that takes timeout: Option<Duration>
and handles that separately would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we proceed without these changes and I'll create an issue to address this in a followup pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we proceed without these changes and I'll create an issue to address this in a followup pr?
I would be Ok with that since the function is already too large before you added 40 lines to it. After refactor it should read more like:
let (foo, bar) = do_one_complex_thing().await?;
let baz = do_another_complex_thing(foo, bar).await?;
do_another_thing(foo, bar, baz).await?;
...etc
I have fixed most of the suggested changes and added a 3 hour default timeout that can be configured with |
Adds 2 new flags to `testsys run`. If `--wait` is used, TestSys will block and monitor the status of all crds created and report the results. `--output-path` can be used to write the test results in json form to a desired location.
@@ -73,6 +75,12 @@ pub enum Error { | |||
#[snafu(context(false), display("{}", source))] | |||
PubsysConfig { source: pubsys_config::Error }, | |||
|
|||
#[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit:
#[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))] | |
#[snafu(display("Failed to create resource '{}': {}", resource_name, error))] |
info!("Successfully added '{}'", crd.name().unwrap()); | ||
} | ||
|
||
if self.wait { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we proceed without these changes and I'll create an issue to address this in a followup pr?
I would be Ok with that since the function is already too large before you added 40 lines to it. After refactor it should read more like:
let (foo, bar) = do_one_complex_thing().await?;
let baz = do_another_complex_thing(foo, bar).await?;
do_another_thing(foo, bar, baz).await?;
...etc
debug!("Testing completed writing results"); | ||
if let Some(output) = self.output_path { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug!("Testing completed writing results"); | |
if let Some(output) = self.output_path { | |
debug!("Testing completed"); | |
if let Some(output) = self.output_path { | |
debug!("Writing test results"); |
Ok(results) | ||
} | ||
|
||
/// Wait until the conflicting resources are deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define what it means for a resource to be conflicting, here, or somewhere? Maybe it exists somewhere already. I think it means a resource that can't be used by more than one test simultaneously? Or one that cannot be created until another is destroyed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is described when conflicting resources are identified. Conflicting resources are resources that must not exist before the resource creation can begin.
} | ||
|
||
impl TestRunResult { | ||
fn context(self) -> Result<HashMap<String, Vec<TestResults>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would results(&self)
be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep consistent with snafu, but I can change it if that's preferred.
This pr is in the wrong git repo now that testsys has moved to twoliter. Closing. |
Issue number:
Closes #
Description of changes:
Testing done:
Ran
cargo make test
withTESTSYS_WAIT=true
and tried settingTESTSYS_OUTPUT_PATH
to verify output.Sample output file:
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.