Skip to content

Conversation

@CamJN
Copy link

@CamJN CamJN commented Apr 26, 2025

I rebased #2586 off master, and am willing to try to fix any issues that remain if they are mentioned here.

case <-ctx.Done():
return context.Cause(ctx)
case <-time.After(time.Duration(try*120) * time.Millisecond):
case <-time.After(d.Timeout / time.Microsecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@CamJN CamJN Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like before it did a linear backoff with step=120ms:

try 1: After 120ms
try2: After 240ms

try 15: After 1,800ms

total: 14,400ms or 14.4s

if the new Timeout is the total time and we want to keep the linear backoff then I think the math should be:

total = 15*(1*step+15*step)/2
(total*2)/(15*16) = step

I implemented somethng like that.

Use: in.use,
Endpoint: ep,
Append: in.actionAppend,
Timeout: in.timeout,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that this is passed as a value rather than added to ctx? So this Create() call can still take more time than timeout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing how the timeout is used:
In builder.Create that this builder.CreateOpts struct is passed to, the context.WithTimeoutCause is created using the opts.Timeout and the opts there is this builder.CreateOpts we are passing to builder.Create in the last argument here. So this Timeout is the one that is used.

Addressing where the Timeout comes from:
The in on this line comes from the createOptions struct passed to the runCreate function we are in, and that is populated in the createCmd function that calls us here. The options variable passed into runCreate has its timeout field populated from the rootOpts struct's timeout field before being passed into the function. The rootOpts var seems to be populated in the rootFlags function, including the timeout.

As for passing the timeout as a field in CreateOpts vs adding it to ctx, I don't know for sure why it was done that way, but I'd guess because nothing else in runCreate modifies the ctx. Not being much of a go user, I will take your word that the Create call can take more time than timeout but the slow part of runCreate seems bounded by the timeout just like before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just add it to the context as with the other commands if we don't know the reason why this one should be different.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm a bit confused what you mean. Upon closer inspection the rm, ls, inspect, and create commands all do it this way, vs the build command that's different. The build command gets the timeout from the control.ControlOptions inclusion in the buildOptions struct, but doesn't seem to use it anywhere, possibly because builds tend to take absolute ages?

As a non go user what does, "add it to the context" mean, and in what way is that different between the create command and the rm, ls, or inspect commands?

Platforms: n.Platforms,
ContextPathHash: b.opts.contextPathHash,
DialMeta: lno.dialMeta,
Timeout: defaultDriverTimeout,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the driver timeout is still constant 120s and can't be changed, wouldn't it be safer if this would be added in the driver pkg. Atm I think you could still theoretically call driver pkg forgetting to pass in InitConfig.Timeout and then possibly bad things can happen as it doesn't seem to be validated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check in GetDriver that the Timeout is set to something > 0.

return nil, err
}
d.defaultLoad = parsed
case "timeout":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this specific for remote driver?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the driver/kubernetes/factory.go New function the option parsing is done in processDriverOpts, the driver/docker-container/factory.go New function missing this option parsing seems like an oversight possibly, I added the timeout option parsing there too, the driver/docker/factory.go New function doesn't seem to do any option parsing at all so not doing any timeout parsing there is consistent with the existing code but I don't know if it is okay.

try := 1
var try uint64 = 1
var maxTries uint64 = 15
var step uint64 = max((uint64(d.Timeout.Abs().Milliseconds()) * 2) / (maxTries * (1 + maxTries)), 1)
Copy link
Member

@tonistiigi tonistiigi May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this construction. Why Abs(), why Milliseconds() , why maxTries ** 2?

I don't think this really makes much difference what the step is. If you set the timeout too short it will error anyway and checking more frequently than 120ms doesn't make it faster but might just increase overhead. Maybe better is to just remember how much time we have spent waiting on each iteration and if the last time.After(time.Duration(try * step) goes over Timeout then just clip it to smth like Timeout - 1 second so that we give the waiter the best chance to succeed before the timeout is reached.

My original comment was for division by time.Microsecond that didn't make any logical sense to me. I still see that in k8s driver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Abs(), why Milliseconds() , why maxTries ** 2?

Abs() because duration is signed and the math doesn't work if the duration is negative, it's just a defensive coding thing.

Milliseconds() because I am working in milliseconds, like the original code before the PR did anything.

maxtries*(maxtries+1) is just part of the math for getting the sum of a number of elements in an arithmetic sequence, like the old timeouts (120ms, 240ms, 360ms, etc).

I just solved for the step size given a total duration, where before the step size was fixed and the total duration was implied (14,400ms). I could keep the step size the same and rather cap the total duration and not control for the number of tries, that's definitely an option.

I replaced the use of time.Microsecond in the k8s driver with /1000.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to altering the number of tries instead of the step size

Use: in.use,
Endpoint: ep,
Append: in.actionAppend,
Timeout: in.timeout,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just add it to the context as with the other commands if we don't know the reason why this one should be different.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you are done and CI is green, you can also squash the commits.

@CamJN CamJN force-pushed the pullrequests/consistent-timeout-parameter branch from b6d2f1c to 4409317 Compare May 1, 2025 06:25
@CamJN CamJN force-pushed the pullrequests/consistent-timeout-parameter branch from 4409317 to a972eb1 Compare May 1, 2025 06:46
@CamJN CamJN closed this May 1, 2025
@avoidik
Copy link

avoidik commented Jan 10, 2026

if you don't mind, I will try again ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants