Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsys: Add --wait flag for test runs #3221

Closed
wants to merge 1 commit into from

Conversation

ecpullen
Copy link
Contributor

Issue number:

Closes #

Description of changes:

testsys: Add `--wait` flag for test runs
    
    Adds 2 new flags to `testsys run`. If `--wait` is used, TestSys will
    block and monitor the status of all crds created and report the results.
    `--output-path` can be used to write the test results in json form to a
    desired location.

Testing done:

Ran cargo make test with TESTSYS_WAIT=true and tried setting TESTSYS_OUTPUT_PATH to verify output.

Sample output file:

{
  "k-test-6-quick": [
    {
      "outcome": "pass",
      "numPassed": 1,
      "numFailed": 0,
      "numSkipped": 6972,
      "otherInfo": ""
    }
  ]
}

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Comment on lines 124 to 134
#[clap(
long,
env = "TESTSYS_OUTPUT_PATH",
parse(from_os_str),
requires = "wait"
)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the output-path flag tied to the wait flag?

Can this instead be tied to the testsys status command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of the tests created by the invocation are output to this output-path. We should also add the same feature to status, but in a separate pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr is enabling automation to call cargo make -e TESTSYS_WAIT=true -e TESTSYS_OUTPUT_PATH=foo.json test and then immediately use jq on foo.json to get the results of the test without having to deal with TestSys itself.

info!(
"Still waiting for resource '{resource_name}' to be deleted from the cluster. Sleeping 60s"
);
tokio::time::sleep(Duration::from_secs(60)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 seconds between polls seem a bit long to me. Any reasoning behind this?
Can we capture the duration of the wait in a const can reuse it whereever it applies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 seconds was arbitrary, but I didn't want to go too short since it could be polling for hours, and 60 seconds didn't seem too bad. I'll bring it out as a constant anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for something like 10 just so you don't waste time when you're doing a quick run, i.e. if this becomes part of ones developer flow.

Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a configurable overall timeout, i.e. we might want the --wait invocation to timeout after 3 hours instead of hanging indefinitely.

info!(
"Still waiting for resource '{resource_name}' to be deleted from the cluster. Sleeping 60s"
);
tokio::time::sleep(Duration::from_secs(60)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for something like 10 just so you don't waste time when you're doing a quick run, i.e. if this becomes part of ones developer flow.

}

#[derive(Debug)]
pub enum TestRunResults {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A plural noun implies some sort of collection.

Suggested change
pub enum TestRunResults {
pub enum TestRunResult {

Copy link
Contributor Author

@ecpullen ecpullen Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a configurable overall timeout, i.e. we might want the --wait invocation to timeout after 3 hours instead of hanging indefinitely.

We definitely don't want a default overall timeout. A sonobuoy test that is waiting for another sonobuoy test to finish would certainly take more than 3 hours. I'll add a timeout for waiting for other completed resources to be deleted (10 mins). I will also add an optional overall timeout.

@ecpullen
Copy link
Contributor Author

ecpullen commented Jun 26, 2023

^

  • Adds TESTYS_WAIT_TIMEOUT to optionally prevent hanging indefinitely
  • Adds a timeout (10 mins) for the user to fix a resource/test that is blocking test execution
  • Moves times to constants.
  • Reduced polling time to 10s from 60s
  • Renamed TestRunResults to TestRunResult

@ecpullen
Copy link
Contributor Author

I'm wondering if TestSys should also stream the logs from the agent's it is watching? Maybe we could add that in a follow up pr.

Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the same timeout being used for these two things?

  • Timeout on a long-running test
  • Timeout on a resource creation error

What about timeout on a long-running resource creation or deletion?

Have you prevented --wait from running for 12 hours when no progress is being made and no errors have been reported?

@ecpullen
Copy link
Contributor Author

ecpullen commented Jun 27, 2023

Is the same timeout being used for these two things?

  • Timeout on a long-running test
  • Timeout on a resource creation error

What about timeout on a long-running resource creation or deletion?

Have you prevented --wait from running for 12 hours when no progress is being made and no errors have been reported?

The timeout for a long running test is the responsibility of the test agent not the run command. A test could take 24 hours to run as its standard time and we shouldn't hard code something that prevents that. Each of the timeouts you are worried about are timeouts that should be implemented elsewhere (mainly the agent itself). Adding any sort of timeout will create a situation where a new test that may take longer than others doesn't work without changing the hard coded timeouts. If you wanted a 3 hour timeout, you can set it with TESTSYS_WAIT_TIMEOUT.

@webern
Copy link
Contributor

webern commented Jun 27, 2023

The timeout for a long running test is the responsibility of the test agent not the run command.

My original idea, which still stands, is a timeout for the --wait... not for the agents. In other words, if testsys is borked, the agent is hung and will never timeout, then I want some timeout on the CLI program that would otherwise wait for it forever. This way we won't have a CI run that hangs forever if testsys isn't behaving as expected.

I also don't think that this command should attempt a cleanup of resources and tests when something bad happens. I don't think there should be this complex and hard to understand timeout window in which you can fix the problem. Instead I would expect the --wait command to FAIL: 1 and leave the cluster in whatever state its in. I can then use that failure to decide whether I want to try to fix it or whether I want to delete all (and CI would likely testsys delete --all).

@ecpullen
Copy link
Contributor Author

The timeout for a long running test is the responsibility of the test agent not the run command.

My original idea, which still stands, is a timeout for the --wait... not for the agents. In other words, if testsys is borked, the agent is hung and will never timeout, then I want some timeout on the CLI program that would otherwise wait for it forever. This way we won't have a CI run that hangs forever if testsys isn't behaving as expected.

Is the configurable timeout acceptable? I am strongly against a default timeout of the --wait for the reasons above. With TESTSYS_WAIT_TIMEOUT the desired effect is reached, without potentially making the code unusable with new agents.

The --wait does have several smart timeouts (10 mins) already.

  • Conflicting resource errored
  • Conflicting resource won't be automatically cleaned up
  • A test is errored or failed and won't be automatically cleaned up

I also don't think that this command should attempt a cleanup of resources and tests when something bad happens. I don't think there should be this complex and hard to understand timeout window in which you can fix the problem. Instead I would expect the --wait command to FAIL: 1 and leave the cluster in whatever state it's in. I can then use that failure to decide whether I want to try to fix it or whether I want to delete all (and CI would likely testsys delete --all).

There currently isn't anything being cleaned up after the test is complete. I have a local commit that will enable automatic clean up, but I think that belongs in a separate pr.

Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've asked @cbgbt to take a look.

Copy link
Contributor

@cbgbt cbgbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's best to rate-limit requests to external services. Retry loops should use exponential backoff with jitter.

I also agree with @webern that a default (though large) timeout is a wise move. Unbounded waiting on a tool intended to be used in CI can be painful for a variety of reasons.

Comment on lines +474 to +472
debug!("Testing completed writing results");
if let Some(output) = self.output_path {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems odd to pair the output-path with --wait. I wonder if this rendering logic should live in a function that would be callable when we add the same capability to testsys status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean.

info!("Successfully added '{}'", crd.name().unwrap());
}

if self.wait {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is too large. I think it could be greatly helped by moving the variant creator above to some kind of Factory-style separate method.

I think similarly abstracting out the wait_for_crds call to one that takes timeout: Option<Duration> and handles that separately would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we proceed without these changes and I'll create an issue to address this in a followup pr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we proceed without these changes and I'll create an issue to address this in a followup pr?

I would be Ok with that since the function is already too large before you added 40 lines to it. After refactor it should read more like:

let (foo, bar) = do_one_complex_thing().await?;
let baz = do_another_complex_thing(foo, bar).await?;
do_another_thing(foo, bar, baz).await?;
...etc

tools/testsys/src/wait.rs Outdated Show resolved Hide resolved
tools/testsys/src/wait.rs Outdated Show resolved Hide resolved
tools/testsys/src/wait.rs Outdated Show resolved Hide resolved
tools/testsys/src/wait.rs Outdated Show resolved Hide resolved
@ecpullen
Copy link
Contributor Author

I have fixed most of the suggested changes and added a 3 hour default timeout that can be configured with TESTSYS_WAIT_TIMEOUT.

Adds 2 new flags to `testsys run`. If `--wait` is used, TestSys will
block and monitor the status of all crds created and report the results.
`--output-path` can be used to write the test results in json form to a
desired location.
@@ -73,6 +75,12 @@ pub enum Error {
#[snafu(context(false), display("{}", source))]
PubsysConfig { source: pubsys_config::Error },

#[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit:

Suggested change
#[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))]
#[snafu(display("Failed to create resource '{}': {}", resource_name, error))]

info!("Successfully added '{}'", crd.name().unwrap());
}

if self.wait {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we proceed without these changes and I'll create an issue to address this in a followup pr?

I would be Ok with that since the function is already too large before you added 40 lines to it. After refactor it should read more like:

let (foo, bar) = do_one_complex_thing().await?;
let baz = do_another_complex_thing(foo, bar).await?;
do_another_thing(foo, bar, baz).await?;
...etc

Comment on lines +471 to +472
debug!("Testing completed writing results");
if let Some(output) = self.output_path {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
debug!("Testing completed writing results");
if let Some(output) = self.output_path {
debug!("Testing completed");
if let Some(output) = self.output_path {
debug!("Writing test results");

Ok(results)
}

/// Wait until the conflicting resources are deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define what it means for a resource to be conflicting, here, or somewhere? Maybe it exists somewhere already. I think it means a resource that can't be used by more than one test simultaneously? Or one that cannot be created until another is destroyed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is described when conflicting resources are identified. Conflicting resources are resources that must not exist before the resource creation can begin.

}

impl TestRunResult {
fn context(self) -> Result<HashMap<String, Vec<TestResults>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would results(&self) be a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep consistent with snafu, but I can change it if that's preferred.

@ecpullen
Copy link
Contributor Author

This pr is in the wrong git repo now that testsys has moved to twoliter. Closing.

@ecpullen ecpullen closed this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants