Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for the goroutines to complete in the MultiService runner #65

Open
mna opened this issue Jan 17, 2023 · 5 comments
Open

Wait for the goroutines to complete in the MultiService runner #65

mna opened this issue Jan 17, 2023 · 5 comments

Comments

@mna
Copy link

mna commented Jan 17, 2023

Hello,

This is not technically an error, but the MultiService implementation currently returns as soon as the first service has run and the parallel goroutines for the other ones have started. The problem is that if this happens e.g. in a web server handling an MDM request, the request might return before all services have executed, and subsequent requests might not yet see the results of those services. That can be a surprising (and hard to identify/debug) behaviour.

In our nanomdm fork at Fleet, we merged a PR to ensure the MultiService does not return until all services have been executed. It maintains the same behaviour and optimization, in that it returns the results of the first service registered and runs all others in parallel, but it waits for the completion instead of leaving the goroutines running unattended. You can see it here: https://github.com/fleetdm/nanomdm/pull/4/files

If that's something you'd be interested in having upstream, I'd be happy to contribute the PR.

Thanks,
Martin

@korylprince
Copy link

The current behavior is obviously faster, and I would imagine wouldn't run into any issues in some scenarios, but your use case is obviously valid as well. I wonder if you couldn't add a default false bool to MultiService that allows turning on this behavior:

type MultiService struct {
	logger  log.Logger
	svcs    []service.CheckinAndCommandService
	ctx     context.Context
	WaitAll bool
}

func (ms *MultiService) runOthers(ctx context.Context, r errorRunner) {
	var wg *sync.WaitGroup
	if ms.WaitAll {
		wg = new(sync.WaitGroup)
	}
	for i, svc := range ms.svcs[1:] {
		go func(n int, s service.CheckinAndCommandService) {
			if ms.WaitAll {
				defer wg.Done()
			}
			err := r(s)
			if err != nil {
				ctxlog.Logger(ctx, ms.logger).Info(
					"sub_service", n,
					"err", err,
				)
			}
		}(i+1, svc)
	}
	if ms.WaitAll {
		wg.Wait()
	}
}

@jessepeterson
Copy link
Member

tl;dr: I'm not in principal against this, especially with a switch to turn it on and off as @korylprince suggested. But:

The problem is that if this happens e.g. in a web server handling an MDM request, the request might return before all services have executed, and subsequent requests might not yet see the results of those services.

NanoMDM itself doesn't have any dependencies on ordering of services other than it's own service. Can you give an example of a service or request that needs this sort of waiting? I'd love to learn more about what specifically is not working for you here.

There is only one such example I can think of and that's the TokenUpdate tally (which tracks how many TokenUpdates an enrollment has seen). We send that out on the web hook so it is looked up immediately after the service has stored it. Unfortunately because we bifurcate the context in the multi-service we can't store it on the request context, either (sadly) — but that'd be the ideal place to put it in this example.

Anyway interested to hear more!

@jessepeterson
Copy link
Member

Also: an alternative could be an entirely different MultiService that waited for all services. If you had a waiting multi server you could actually re-use the context which has some benefits!

@mna
Copy link
Author

mna commented Jan 18, 2023

Can you give an example of a service or request that needs this sort of waiting? I'd love to learn more about what specifically is not working for you here.

Sure, in our case we need to store some information in our tables in addition to the standard service, so e.g. on Authenticate, it runs the standard authenticate service, and then a service that inserts the device's information in other application-specific tables. There are a number of ways we could do this, of course, but using the multi-service was interesting since we receive the decoded parameters as arguments.

NanoMDM itself doesn't have any dependencies on ordering of services other than it's own service.

Note that it's still the case for the waiting scenario - there is no ordering of the rest of the services, they run in parallel, but they all complete before the call returns.

That being said, I'm not against the Wait option idea, but I wonder what's the use-case for not waiting? Whatever other services that are passed to the multi service may not complete for an arbitrary amount of time, if the caller passes multiple services, surely it needs to be able to see the results of those at some point after the call. Not waiting can cause a number of issues that I already hinted on, but more specifically:

  • For short-lived commands (maybe an unlikely scenario for this, but still), the command may exit before the other services complete (as they run in unattended goroutines)
  • For web servers (much more likely scenario), a graceful shutdown would not know about and wait for those other services to complete (it waits for connections to become idle, but since the requests calling multi-services may return before the other services complete, the shutdown will not wait for those)
  • If someone doesn't need to wait for some services, it is trivial for them to start a goroutine in the service's implementation and have it not wait. It is impossible to do the reverse (making the services wait if we want to) - without adding to the API (such as the Wait field mentioned above), that is.

@jonathanweibel
Copy link

Hello,

Our team is currently facing a use case where we would like to wait for all services. We host NanoMDM on a serverless cloud service to cut costs. The CPU allocation is only guaranteed during the HTTP Request/Response lifecycle. So, if my understanding of the current MultiService implementation is correct, only the first service is guaranteed to run on our hosting solution.

Is there any chance for the Wait option feature to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants