Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conductor: Add support for retries for failed commands #3864

Merged
merged 1 commit into from
Mar 6, 2025

Conversation

barney-s
Copy link
Collaborator

@barney-s barney-s commented Mar 5, 2025

Add support for retries for failed commands.

  • each command can cap the retry count using MaxRetries. This is useful for deterministic commands.
  • global retry flag (default: 3) is used for all commands.

@barney-s barney-s changed the title Add support for retries for failed commands. Conductor: Add support for retries for failed commands Mar 5, 2025
Copy link
Collaborator

@maqiuyujoyce maqiuyujoyce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the retries, @barney-s !

I have a few thoughts/questions:

  1. The place where I found retries are needed is when the communication with LLM hangs/terminates unexpectedly. We talk to codebot for steps like script generation, API identification (right?), mockgcp generation. So I'm thinking we may want to focus on retries about these steps. And in this PR, I didn't find the retry for creating scripts. As I shared, I had to go through 12 retries to finish generating scripts for 40 resources.
  2. I think with max retries of 2, the retry may not be that useful, especially you set the default retry to be 3. I'd probably start the max retries to be at least 10 considering we are doing batch work.
  3. Have you tried to verify the retry works for steps involving using codebot for more than 50 resources?

@barney-s
Copy link
Collaborator Author

barney-s commented Mar 5, 2025

  1. The place where I found retries are needed is when the communication with LLM hangs/terminates unexpectedly. We talk to codebot for steps like script generation, API identification (right?), mockgcp generation. So I'm thinking we may want to focus on retries about these steps. And in this PR, I didn't find the retry for creating scripts. As I shared, I had to go through 12 retries to finish generating scripts for 40 resources.

Bumped the default retries to 10. For LLM steps i have not set maxRetries. So it would take the user passed/default retries count for those.

  1. I think with max retries of 2, the retry may not be that useful, especially you set the default retry to be 3. I'd probably start the max retries to be at least 10 considering we are doing batch work.

maxRetries is an optional per command max that is hardcoded. The goal is not to retry certain deterministic commands N times and instead cap them to maxRetries. Re your suggestion 2 is as good as 1. I tend to agree. Just that it tries once more.

  1. Have you tried to verify the retry works for steps involving using codebot for more than 50 resources?

Have not yet. Will do and update here.

@barney-s barney-s force-pushed the support-retry branch 3 times, most recently from a6323da to 9d90629 Compare March 6, 2025 00:54
Copy link
Collaborator

@maqiuyujoyce maqiuyujoyce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

if cfg.Stdin != nil {
log.Printf("[%s] stdin: %s", cfg.Name, cfg.Stdin)
if cfg.RetryBackoff == 0 {
cfg.RetryBackoff = time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If this is related to quota then I guess one second of backoff time may not be sufficient. But we can improve it later.

Copy link
Collaborator Author

@barney-s barney-s Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to 60s for generative commands.

Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maqiuyujoyce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- each command can cap the retry count using MaxRetries. This is useful for deterministic commands.
- global retry flag (default: 10) is used for all commands.
- generative commands have a retry backoff of 60s
@anhdle-sso
Copy link
Collaborator

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Mar 6, 2025
@google-oss-prow google-oss-prow bot merged commit 8e818a1 into GoogleCloudPlatform:master Mar 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants