testsys: Add `--wait` flag for test runs #3221

ecpullen · 2023-06-22T20:19:56Z

Issue number:

Closes #

Description of changes:

testsys: Add `--wait` flag for test runs
    
    Adds 2 new flags to `testsys run`. If `--wait` is used, TestSys will
    block and monitor the status of all crds created and report the results.
    `--output-path` can be used to write the test results in json form to a
    desired location.

Testing done:

Ran cargo make test with TESTSYS_WAIT=true and tried setting TESTSYS_OUTPUT_PATH to verify output.

Sample output file:

{
  "k-test-6-quick": [
    {
      "outcome": "pass",
      "numPassed": 1,
      "numFailed": 0,
      "numSkipped": 6972,
      "otherInfo": ""
    }
  ]
}

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

etungsten · 2023-06-22T22:59:21Z

tools/testsys/src/run.rs

+    #[clap(
+        long,
+        env = "TESTSYS_OUTPUT_PATH",
+        parse(from_os_str),
+        requires = "wait"
+    )]


Why is the output-path flag tied to the wait flag?

Can this instead be tied to the testsys status command?

The result of the tests created by the invocation are output to this output-path. We should also add the same feature to status, but in a separate pr.

This pr is enabling automation to call cargo make -e TESTSYS_WAIT=true -e TESTSYS_OUTPUT_PATH=foo.json test and then immediately use jq on foo.json to get the results of the test without having to deal with TestSys itself.

etungsten · 2023-06-22T23:05:11Z

tools/testsys/src/wait.rs

+        info!(
+            "Still waiting for resource '{resource_name}' to be deleted from the cluster. Sleeping 60s"
+        );
+        tokio::time::sleep(Duration::from_secs(60)).await;


60 seconds between polls seem a bit long to me. Any reasoning behind this?
Can we capture the duration of the wait in a const can reuse it whereever it applies?

60 seconds was arbitrary, but I didn't want to go too short since it could be polling for hours, and 60 seconds didn't seem too bad. I'll bring it out as a constant anyways.

I'd vote for something like 10 just so you don't waste time when you're doing a quick run, i.e. if this becomes part of ones developer flow.

webern

There should be a configurable overall timeout, i.e. we might want the --wait invocation to timeout after 3 hours instead of hanging indefinitely.

webern · 2023-06-26T16:50:34Z

tools/testsys/src/wait.rs

+        info!(
+            "Still waiting for resource '{resource_name}' to be deleted from the cluster. Sleeping 60s"
+        );
+        tokio::time::sleep(Duration::from_secs(60)).await;


I'd vote for something like 10 just so you don't waste time when you're doing a quick run, i.e. if this becomes part of ones developer flow.

webern · 2023-06-26T16:51:52Z

tools/testsys/src/wait.rs

+}
+
+#[derive(Debug)]
+pub enum TestRunResults {


A plural noun implies some sort of collection.

Suggested change

pub enum TestRunResults {

pub enum TestRunResult {

There should be a configurable overall timeout, i.e. we might want the --wait invocation to timeout after 3 hours instead of hanging indefinitely.

We definitely don't want a default overall timeout. A sonobuoy test that is waiting for another sonobuoy test to finish would certainly take more than 3 hours. I'll add a timeout for waiting for other completed resources to be deleted (10 mins). I will also add an optional overall timeout.

ecpullen · 2023-06-26T20:17:19Z

^

Adds TESTYS_WAIT_TIMEOUT to optionally prevent hanging indefinitely
Adds a timeout (10 mins) for the user to fix a resource/test that is blocking test execution
Moves times to constants.
Reduced polling time to 10s from 60s
Renamed TestRunResults to TestRunResult

ecpullen · 2023-06-26T20:23:04Z

I'm wondering if TestSys should also stream the logs from the agent's it is watching? Maybe we could add that in a follow up pr.

webern

Is the same timeout being used for these two things?

Timeout on a long-running test
Timeout on a resource creation error

What about timeout on a long-running resource creation or deletion?

Have you prevented --wait from running for 12 hours when no progress is being made and no errors have been reported?

ecpullen · 2023-06-27T20:11:23Z

Is the same timeout being used for these two things?

Timeout on a long-running test

Timeout on a resource creation error

What about timeout on a long-running resource creation or deletion?

Have you prevented --wait from running for 12 hours when no progress is being made and no errors have been reported?

The timeout for a long running test is the responsibility of the test agent not the run command. A test could take 24 hours to run as its standard time and we shouldn't hard code something that prevents that. Each of the timeouts you are worried about are timeouts that should be implemented elsewhere (mainly the agent itself). Adding any sort of timeout will create a situation where a new test that may take longer than others doesn't work without changing the hard coded timeouts. If you wanted a 3 hour timeout, you can set it with TESTSYS_WAIT_TIMEOUT.

webern · 2023-06-27T23:19:18Z

The timeout for a long running test is the responsibility of the test agent not the run command.

My original idea, which still stands, is a timeout for the --wait... not for the agents. In other words, if testsys is borked, the agent is hung and will never timeout, then I want some timeout on the CLI program that would otherwise wait for it forever. This way we won't have a CI run that hangs forever if testsys isn't behaving as expected.

I also don't think that this command should attempt a cleanup of resources and tests when something bad happens. I don't think there should be this complex and hard to understand timeout window in which you can fix the problem. Instead I would expect the --wait command to FAIL: 1 and leave the cluster in whatever state its in. I can then use that failure to decide whether I want to try to fix it or whether I want to delete all (and CI would likely testsys delete --all).

ecpullen · 2023-06-27T23:32:13Z

The timeout for a long running test is the responsibility of the test agent not the run command.

My original idea, which still stands, is a timeout for the --wait... not for the agents. In other words, if testsys is borked, the agent is hung and will never timeout, then I want some timeout on the CLI program that would otherwise wait for it forever. This way we won't have a CI run that hangs forever if testsys isn't behaving as expected.

Is the configurable timeout acceptable? I am strongly against a default timeout of the --wait for the reasons above. With TESTSYS_WAIT_TIMEOUT the desired effect is reached, without potentially making the code unusable with new agents.

The --wait does have several smart timeouts (10 mins) already.

Conflicting resource errored
Conflicting resource won't be automatically cleaned up
A test is errored or failed and won't be automatically cleaned up

I also don't think that this command should attempt a cleanup of resources and tests when something bad happens. I don't think there should be this complex and hard to understand timeout window in which you can fix the problem. Instead I would expect the --wait command to FAIL: 1 and leave the cluster in whatever state it's in. I can then use that failure to decide whether I want to try to fix it or whether I want to delete all (and CI would likely testsys delete --all).

There currently isn't anything being cleaned up after the test is complete. I have a local commit that will enable automatic clean up, but I think that belongs in a separate pr.

webern

I've asked @cbgbt to take a look.

cbgbt

In general, it's best to rate-limit requests to external services. Retry loops should use exponential backoff with jitter.

I also agree with @webern that a default (though large) timeout is a wise move. Unbounded waiting on a tool intended to be used in CI can be painful for a variety of reasons.

cbgbt · 2023-06-30T23:55:47Z

tools/testsys/src/run.rs

+            debug!("Testing completed writing results");
+            if let Some(output) = self.output_path {


I agree that it seems odd to pair the output-path with --wait. I wonder if this rendering logic should live in a function that would be callable when we add the same capability to testsys status?

I'm not sure what you mean.

cbgbt · 2023-06-30T23:56:51Z

tools/testsys/src/run.rs

            info!("Successfully added '{}'", crd.name().unwrap());
        }

+        if self.wait {


This method is too large. I think it could be greatly helped by moving the variant creator above to some kind of Factory-style separate method.

I think similarly abstracting out the wait_for_crds call to one that takes timeout: Option<Duration> and handles that separately would help.

Can we proceed without these changes and I'll create an issue to address this in a followup pr?

Can we proceed without these changes and I'll create an issue to address this in a followup pr?

I would be Ok with that since the function is already too large before you added 40 lines to it. After refactor it should read more like:

let (foo, bar) = do_one_complex_thing().await?;
let baz = do_another_complex_thing(foo, bar).await?;
do_another_thing(foo, bar, baz).await?;
...etc

tools/testsys/src/wait.rs

ecpullen · 2023-07-17T20:20:08Z

I have fixed most of the suggested changes and added a 3 hour default timeout that can be configured with TESTSYS_WAIT_TIMEOUT.

Adds 2 new flags to `testsys run`. If `--wait` is used, TestSys will block and monitor the status of all crds created and report the results. `--output-path` can be used to write the test results in json form to a desired location.

webern · 2023-07-24T17:32:49Z

tools/testsys/src/error.rs

@@ -73,6 +75,12 @@ pub enum Error {
    #[snafu(context(false), display("{}", source))]
    PubsysConfig { source: pubsys_config::Error },

+    #[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))]


style nit:

Suggested change

#[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))]

#[snafu(display("Failed to create resource '{}': {}", resource_name, error))]

webern · 2023-07-24T17:36:48Z

tools/testsys/src/run.rs

            info!("Successfully added '{}'", crd.name().unwrap());
        }

+        if self.wait {


Can we proceed without these changes and I'll create an issue to address this in a followup pr?

I would be Ok with that since the function is already too large before you added 40 lines to it. After refactor it should read more like:

let (foo, bar) = do_one_complex_thing().await?;
let baz = do_another_complex_thing(foo, bar).await?;
do_another_thing(foo, bar, baz).await?;
...etc

webern · 2023-07-24T17:38:14Z

tools/testsys/src/run.rs

+            debug!("Testing completed writing results");
+            if let Some(output) = self.output_path {


Suggested change

debug!("Testing completed writing results");

if let Some(output) = self.output_path {

debug!("Testing completed");

if let Some(output) = self.output_path {

debug!("Writing test results");

webern · 2023-07-24T17:43:31Z

tools/testsys/src/wait.rs

+    Ok(results)
+}
+
+/// Wait until the conflicting resources are deleted.


Can you define what it means for a resource to be conflicting, here, or somewhere? Maybe it exists somewhere already. I think it means a resource that can't be used by more than one test simultaneously? Or one that cannot be created until another is destroyed?

It is described when conflicting resources are identified. Conflicting resources are resources that must not exist before the resource creation can begin.

webern · 2023-07-24T17:47:48Z

tools/testsys/src/wait.rs

+}
+
+impl TestRunResult {
+    fn context(self) -> Result<HashMap<String, Vec<TestResults>>> {


would results(&self) be a better name?

I was trying to keep consistent with snafu, but I can change it if that's preferred.

ecpullen · 2023-11-13T19:25:41Z

This pr is in the wrong git repo now that testsys has moved to twoliter. Closing.

ecpullen force-pushed the testsys-wait branch from 8406b42 to 281a79b Compare June 22, 2023 20:30

etungsten reviewed Jun 22, 2023

View reviewed changes

webern reviewed Jun 26, 2023

View reviewed changes

ecpullen force-pushed the testsys-wait branch from 281a79b to cf54d92 Compare June 26, 2023 20:13

ecpullen requested review from webern and etungsten June 26, 2023 20:17

ecpullen force-pushed the testsys-wait branch from cf54d92 to 9b4c672 Compare June 26, 2023 21:14

etungsten approved these changes Jun 26, 2023

View reviewed changes

webern reviewed Jun 27, 2023

View reviewed changes

webern suggested changes Jun 28, 2023

View reviewed changes

cbgbt requested changes Jul 1, 2023

View reviewed changes

ecpullen force-pushed the testsys-wait branch from 9b4c672 to db63cd8 Compare July 17, 2023 20:18

ecpullen requested review from webern and cbgbt July 17, 2023 20:19

testsys: Add --wait flag for test runs

813df97

Adds 2 new flags to `testsys run`. If `--wait` is used, TestSys will block and monitor the status of all crds created and report the results. `--output-path` can be used to write the test results in json form to a desired location.

ecpullen force-pushed the testsys-wait branch from db63cd8 to 813df97 Compare July 18, 2023 16:54

webern approved these changes Jul 24, 2023

View reviewed changes

cbgbt approved these changes Jul 25, 2023

View reviewed changes

ecpullen closed this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testsys: Add `--wait` flag for test runs #3221

testsys: Add `--wait` flag for test runs #3221

ecpullen commented Jun 22, 2023

etungsten Jun 22, 2023

ecpullen Jun 22, 2023

ecpullen Jun 22, 2023

etungsten Jun 22, 2023

ecpullen Jun 22, 2023

webern Jun 26, 2023

webern left a comment

webern Jun 26, 2023

webern Jun 26, 2023

ecpullen Jun 26, 2023 •

edited

Loading

ecpullen commented Jun 26, 2023 •

edited

Loading

ecpullen commented Jun 26, 2023

webern left a comment

ecpullen commented Jun 27, 2023 •

edited

Loading

webern commented Jun 27, 2023

ecpullen commented Jun 27, 2023

webern left a comment

cbgbt left a comment

cbgbt Jun 30, 2023

ecpullen Jul 17, 2023

cbgbt Jun 30, 2023

ecpullen Jul 17, 2023

webern Jul 24, 2023

ecpullen commented Jul 17, 2023

webern Jul 24, 2023

webern Jul 24, 2023

webern Jul 24, 2023

webern Jul 24, 2023

ecpullen Jul 24, 2023

webern Jul 24, 2023

ecpullen Jul 24, 2023

ecpullen commented Nov 13, 2023

		debug!("Testing completed writing results");
		if let Some(output) = self.output_path {

	#[snafu(display("Resource '{}' failed to be created: {}", resource_name, error))]
	#[snafu(display("Failed to create resource '{}': {}", resource_name, error))]

testsys: Add --wait flag for test runs #3221

testsys: Add --wait flag for test runs #3221

Conversation

ecpullen commented Jun 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webern left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecpullen Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

ecpullen commented Jun 26, 2023 • edited Loading

ecpullen commented Jun 26, 2023

webern left a comment

Choose a reason for hiding this comment

ecpullen commented Jun 27, 2023 • edited Loading

webern commented Jun 27, 2023

ecpullen commented Jun 27, 2023

webern left a comment

Choose a reason for hiding this comment

cbgbt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecpullen commented Jul 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecpullen commented Nov 13, 2023

testsys: Add `--wait` flag for test runs #3221

testsys: Add `--wait` flag for test runs #3221

ecpullen Jun 26, 2023 •

edited

Loading

ecpullen commented Jun 26, 2023 •

edited

Loading

ecpullen commented Jun 27, 2023 •

edited

Loading