Conductor: Add support for retries for failed commands #3864

barney-s · 2025-03-05T20:24:43Z

Add support for retries for failed commands.

each command can cap the retry count using MaxRetries. This is useful for deterministic commands.
global retry flag (default: 3) is used for all commands.

maqiuyujoyce

Thank you for adding the retries, @barney-s !

I have a few thoughts/questions:

The place where I found retries are needed is when the communication with LLM hangs/terminates unexpectedly. We talk to codebot for steps like script generation, API identification (right?), mockgcp generation. So I'm thinking we may want to focus on retries about these steps. And in this PR, I didn't find the retry for creating scripts. As I shared, I had to go through 12 retries to finish generating scripts for 40 resources.
I think with max retries of 2, the retry may not be that useful, especially you set the default retry to be 3. I'd probably start the max retries to be at least 10 considering we are doing batch work.
Have you tried to verify the retry works for steps involving using codebot for more than 50 resources?

barney-s · 2025-03-05T23:46:35Z

The place where I found retries are needed is when the communication with LLM hangs/terminates unexpectedly. We talk to codebot for steps like script generation, API identification (right?), mockgcp generation. So I'm thinking we may want to focus on retries about these steps. And in this PR, I didn't find the retry for creating scripts. As I shared, I had to go through 12 retries to finish generating scripts for 40 resources.

Bumped the default retries to 10. For LLM steps i have not set maxRetries. So it would take the user passed/default retries count for those.

I think with max retries of 2, the retry may not be that useful, especially you set the default retry to be 3. I'd probably start the max retries to be at least 10 considering we are doing batch work.

maxRetries is an optional per command max that is hardcoded. The goal is not to retry certain deterministic commands N times and instead cap them to maxRetries. Re your suggestion 2 is as good as 1. I tend to agree. Just that it tries once more.

Have you tried to verify the retry works for steps involving using codebot for more than 50 resources?

Have not yet. Will do and update here.

maqiuyujoyce

/lgtm
/approve

experiments/conductor/cmd/runner/utilities.go

maqiuyujoyce · 2025-03-06T01:06:45Z

experiments/conductor/cmd/runner/utilities.go

-	if cfg.Stdin != nil {
-		log.Printf("[%s] stdin: %s", cfg.Name, cfg.Stdin)
+	if cfg.RetryBackoff == 0 {
+		cfg.RetryBackoff = time.Second


Nit: If this is related to quota then I guess one second of backoff time may not be sufficient. But we can improve it later.

changed it to 60s for generative commands.

experiments/conductor/cmd/runner/runner.go

google-oss-prow · 2025-03-06T01:08:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maqiuyujoyce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~experiments/OWNERS~~ [maqiuyujoyce]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- each command can cap the retry count using MaxRetries. This is useful for deterministic commands. - global retry flag (default: 10) is used for all commands. - generative commands have a retry backoff of 60s

anhdle-sso · 2025-03-06T02:33:04Z

/lgtm

barney-s requested review from cheftako, maqiuyujoyce and anhdle-sso March 5, 2025 20:52

barney-s changed the title ~~Add support for retries for failed commands.~~ Conductor: Add support for retries for failed commands Mar 5, 2025

maqiuyujoyce reviewed Mar 5, 2025

View reviewed changes

barney-s force-pushed the support-retry branch 3 times, most recently from a6323da to 9d90629 Compare March 6, 2025 00:54

maqiuyujoyce reviewed Mar 6, 2025

View reviewed changes

google-oss-prow bot assigned maqiuyujoyce Mar 6, 2025

google-oss-prow bot added the lgtm label Mar 6, 2025

google-oss-prow bot added the approved label Mar 6, 2025

Add support for retries for failed commands.

0b35b0c

- each command can cap the retry count using MaxRetries. This is useful for deterministic commands. - global retry flag (default: 10) is used for all commands. - generative commands have a retry backoff of 60s

barney-s force-pushed the support-retry branch from 9d90629 to 0b35b0c Compare March 6, 2025 02:26

google-oss-prow bot removed the lgtm label Mar 6, 2025

google-oss-prow bot assigned anhdle-sso Mar 6, 2025

google-oss-prow bot added the lgtm label Mar 6, 2025

google-oss-prow bot merged commit 8e818a1 into GoogleCloudPlatform:master Mar 6, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conductor: Add support for retries for failed commands #3864

Conductor: Add support for retries for failed commands #3864

barney-s commented Mar 5, 2025

maqiuyujoyce left a comment

barney-s commented Mar 5, 2025 •

edited

Loading

maqiuyujoyce left a comment

maqiuyujoyce Mar 6, 2025

barney-s Mar 6, 2025 •

edited

Loading

google-oss-prow bot commented Mar 6, 2025

anhdle-sso commented Mar 6, 2025

Conductor: Add support for retries for failed commands #3864

Conductor: Add support for retries for failed commands #3864

Conversation

barney-s commented Mar 5, 2025

maqiuyujoyce left a comment

Choose a reason for hiding this comment

barney-s commented Mar 5, 2025 • edited Loading

maqiuyujoyce left a comment

Choose a reason for hiding this comment

maqiuyujoyce Mar 6, 2025

Choose a reason for hiding this comment

barney-s Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

google-oss-prow bot commented Mar 6, 2025

anhdle-sso commented Mar 6, 2025

barney-s commented Mar 5, 2025 •

edited

Loading

barney-s Mar 6, 2025 •

edited

Loading