-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: remove listener closing race #253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic sounds reasonable to me. Thanks again for the detailed analysis, @flotter!
27f44c0
to
1c272e5
Compare
We use http.Server with a number of custom net.Listener instances. This patch makes some changes to the daemon.Stop(...) function. In order for the server instance to Shutdown() gracefully, the listener termination must be performed by the shutdown process itself. The server shutdown process is thread-safe, and tracks the listeners in use. Thread safety is important because the http.Server has two places where listeners can be terminated: 1. The thread hosting the Serve() method, using a deferred mutex protected closing function for all registered listeners. 2. The Shutdown() method (with mutex protection), which will close listeners async resuling in an error from the listener Accept(), but this special case is handled in Serve() and returns the ErrServerClosed error which means shutdown was done cleanly. However, in daemon.go we also pre-close all the listeners before the call to Shutdown(), which causes two kinds of errors: 1. accept tcp [::]:<port>: use of closed network connection This error is actually masked because of tomb.Kill() (when state is dying) for the termination of the daemon. See the daemon.Start() exit code. 2. close tcp [::]:<port>: use of closed network connection This is the actual race condition symptom that can sometimes be seen in CI logs. Since the listeners are pre-closed, this error is expected. However, errors discovered during deferred closure in the Serve() thread are discarded while errors in Shutdown() as reported. The race is between the two closing paths. If the Shutdown path wins the sync.Once race, we see the error. If the deferred Serve() path wins, we see nothing. This race is only triggered because we manually close the listener first. A standalone simulation of the problem can be seen here: https://gist.github.com/flotter/483d233d38f0961a52f7ef7405887012 The following changes are introduced: - Remove the early manual listener closing code so its properly done by the Shutdown() call.
1c272e5
to
60be0c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread hosting the Serve() method, using a deferred mutex protected closing function for all registered listeners.
Where are these Close calls? Both Serve goroutines are fired in the Start method, and there are no Close calls there?
@niemeyer The close calls Fred's referring to are the |
@niemeyer's comments from Mattermost for the record:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just merge this for now. The interlock between the Init of the Daemon, and the Start/Stop pair, don't feel ideal. For example, if we Init the daemon, we already get open listeners, but to get rid of the open listeners, we need to Stop it. Except, if we run Init/Stop without Start, it will crash because d.serve is not there yet. So it does need polishing.
With that said, given the above it also feels like dropping the explicit closing won't make it any worse, since Shutdown is called unconditionally on that sequence. So +1 on just merging this now and polishing it later.
We use http.Server with a number of custom net.Listener instances.
This patch makes some changes to the daemon.Stop(...) function.
In order for the server instance to Shutdown() gracefully, the listener
termination must be performed by the shutdown process itself. The
server shutdown process is thread-safe, and tracks the listeners in
use. Thread safety is important because the http.Server has two places
where listeners can be terminated:
The thread hosting the Serve() method, using a deferred mutex
protected closing function for all registered listeners.
The Shutdown() method (with mutex protection), which will close
listeners async resulting in an error from the listener Accept(),
but this special case is handled in Serve() and returns the
ErrServerClosed error which means shutdown was done cleanly.
However, in daemon.go we also pre-close all the listeners before the
call to Shutdown(), which causes two kinds of errors:
This error is actually masked because of tomb.Kill() (when state is dying)
for the termination of the daemon. See the daemon.Start() exit code.
This is the actual race condition symptom that can sometimes be seen in
CI logs. Since the listeners are pre-closed, this error is expected. However,
errors discovered during deferred closure in the Serve() thread are discarded
while errors in Shutdown() as reported.
The race is between the two closing paths. If the Shutdown path wins the
sync.Once race, we see the error. If the deferred Serve() path wins, we see
nothing.
This race is only triggered because we manually close the listener first.
A stand-alone simulation of the problem can be seen here:
https://gist.github.com/flotter/483d233d38f0961a52f7ef7405887012
The following changes are introduced:
the Shutdown() call.