-
Notifications
You must be signed in to change notification settings - Fork 617
Consistent timeout parameter #3159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent timeout parameter #3159
Conversation
driver/docker-container/driver.go
Outdated
| case <-ctx.Done(): | ||
| return context.Cause(ctx) | ||
| case <-time.After(time.Duration(try*120) * time.Millisecond): | ||
| case <-time.After(d.Timeout / time.Microsecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like before it did a linear backoff with step=120ms:
try 1: After 120ms
try2: After 240ms
…
try 15: After 1,800ms
total: 14,400ms or 14.4s
if the new Timeout is the total time and we want to keep the linear backoff then I think the math should be:
total = 15*(1*step+15*step)/2
(total*2)/(15*16) = step
I implemented somethng like that.
| Use: in.use, | ||
| Endpoint: ep, | ||
| Append: in.actionAppend, | ||
| Timeout: in.timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this is passed as a value rather than added to ctx? So this Create() call can still take more time than timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing how the timeout is used:
In builder.Create that this builder.CreateOpts struct is passed to, the context.WithTimeoutCause is created using the opts.Timeout and the opts there is this builder.CreateOpts we are passing to builder.Create in the last argument here. So this Timeout is the one that is used.
Addressing where the Timeout comes from:
The in on this line comes from the createOptions struct passed to the runCreate function we are in, and that is populated in the createCmd function that calls us here. The options variable passed into runCreate has its timeout field populated from the rootOpts struct's timeout field before being passed into the function. The rootOpts var seems to be populated in the rootFlags function, including the timeout.
As for passing the timeout as a field in CreateOpts vs adding it to ctx, I don't know for sure why it was done that way, but I'd guess because nothing else in runCreate modifies the ctx. Not being much of a go user, I will take your word that the Create call can take more time than timeout but the slow part of runCreate seems bounded by the timeout just like before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just add it to the context as with the other commands if we don't know the reason why this one should be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm a bit confused what you mean. Upon closer inspection the rm, ls, inspect, and create commands all do it this way, vs the build command that's different. The build command gets the timeout from the control.ControlOptions inclusion in the buildOptions struct, but doesn't seem to use it anywhere, possibly because builds tend to take absolute ages?
As a non go user what does, "add it to the context" mean, and in what way is that different between the create command and the rm, ls, or inspect commands?
| Platforms: n.Platforms, | ||
| ContextPathHash: b.opts.contextPathHash, | ||
| DialMeta: lno.dialMeta, | ||
| Timeout: defaultDriverTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the driver timeout is still constant 120s and can't be changed, wouldn't it be safer if this would be added in the driver pkg. Atm I think you could still theoretically call driver pkg forgetting to pass in InitConfig.Timeout and then possibly bad things can happen as it doesn't seem to be validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check in GetDriver that the Timeout is set to something > 0.
| return nil, err | ||
| } | ||
| d.defaultLoad = parsed | ||
| case "timeout": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this specific for remote driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the driver/kubernetes/factory.go New function the option parsing is done in processDriverOpts, the driver/docker-container/factory.go New function missing this option parsing seems like an oversight possibly, I added the timeout option parsing there too, the driver/docker/factory.go New function doesn't seem to do any option parsing at all so not doing any timeout parsing there is consistent with the existing code but I don't know if it is okay.
driver/docker-container/driver.go
Outdated
| try := 1 | ||
| var try uint64 = 1 | ||
| var maxTries uint64 = 15 | ||
| var step uint64 = max((uint64(d.Timeout.Abs().Milliseconds()) * 2) / (maxTries * (1 + maxTries)), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this construction. Why Abs(), why Milliseconds() , why maxTries ** 2?
I don't think this really makes much difference what the step is. If you set the timeout too short it will error anyway and checking more frequently than 120ms doesn't make it faster but might just increase overhead. Maybe better is to just remember how much time we have spent waiting on each iteration and if the last time.After(time.Duration(try * step) goes over Timeout then just clip it to smth like Timeout - 1 second so that we give the waiter the best chance to succeed before the timeout is reached.
My original comment was for division by time.Microsecond that didn't make any logical sense to me. I still see that in k8s driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Abs(), why Milliseconds() , why maxTries ** 2?
Abs() because duration is signed and the math doesn't work if the duration is negative, it's just a defensive coding thing.
Milliseconds() because I am working in milliseconds, like the original code before the PR did anything.
maxtries*(maxtries+1) is just part of the math for getting the sum of a number of elements in an arithmetic sequence, like the old timeouts (120ms, 240ms, 360ms, etc).
I just solved for the step size given a total duration, where before the step size was fixed and the total duration was implied (14,400ms). I could keep the step size the same and rather cap the total duration and not control for the number of tries, that's definitely an option.
I replaced the use of time.Microsecond in the k8s driver with /1000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to altering the number of tries instead of the step size
| Use: in.use, | ||
| Endpoint: ep, | ||
| Append: in.actionAppend, | ||
| Timeout: in.timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just add it to the context as with the other commands if we don't know the reason why this one should be different.
tonistiigi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you are done and CI is green, you can also squash the commits.
Signed-off-by: avoidik <[email protected]> Signed-off-by: Viacheslav Vasilyev <[email protected]>
Signed-off-by: Viacheslav Vasilyev <[email protected]>
Signed-off-by: Viacheslav Vasilyev <[email protected]>
Signed-off-by: Viacheslav Vasilyev <[email protected]>
change try count not step size sill control for total time
b6d2f1c to
4409317
Compare
4409317 to
a972eb1
Compare
|
if you don't mind, I will try again ;) |
I rebased #2586 off master, and am willing to try to fix any issues that remain if they are mentioned here.