-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conductor: Add support for retries for failed commands #3864
Conductor: Add support for retries for failed commands #3864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the retries, @barney-s !
I have a few thoughts/questions:
- The place where I found retries are needed is when the communication with LLM hangs/terminates unexpectedly. We talk to codebot for steps like script generation, API identification (right?), mockgcp generation. So I'm thinking we may want to focus on retries about these steps. And in this PR, I didn't find the retry for creating scripts. As I shared, I had to go through 12 retries to finish generating scripts for 40 resources.
- I think with max retries of 2, the retry may not be that useful, especially you set the default retry to be 3. I'd probably start the max retries to be at least 10 considering we are doing batch work.
- Have you tried to verify the retry works for steps involving using codebot for more than 50 resources?
Bumped the default retries to 10. For LLM steps i have not set maxRetries. So it would take the user passed/default retries count for those.
maxRetries is an optional per command max that is hardcoded. The goal is not to retry certain deterministic commands N times and instead cap them to maxRetries. Re your suggestion 2 is as good as 1. I tend to agree. Just that it tries once more.
Have not yet. Will do and update here. |
a6323da
to
9d90629
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
if cfg.Stdin != nil { | ||
log.Printf("[%s] stdin: %s", cfg.Name, cfg.Stdin) | ||
if cfg.RetryBackoff == 0 { | ||
cfg.RetryBackoff = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If this is related to quota then I guess one second of backoff time may not be sufficient. But we can improve it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to 60s for generative commands.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: maqiuyujoyce The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- each command can cap the retry count using MaxRetries. This is useful for deterministic commands. - global retry flag (default: 10) is used for all commands. - generative commands have a retry backoff of 60s
/lgtm |
8e818a1
into
GoogleCloudPlatform:master
Add support for retries for failed commands.