Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: improve reconnection reliability after process reloads (#32707)
This commit includes a 7 character fix in lib/service/connect.go to call connector.Close() instead of connector.Client.Close() when a new client fails to ping the auth server. connector.Close() correctly avoids closing the client if it is a shared copy of the Instance client. The call to connector.Client.Close() was causing intermittent problems where reconnectToAuthService could get stuck repeatedly trying to use the same client that was just closed. This appears to be fixed now that the Instance client is not being improperly closed by other components. I discovered this issue because it manifested itself in flaky failures of TestHSMMigrate, where logs indicated that the Instance client was being repeatedly reused but the connection was never successful ``` {"caller":"service/connect.go:1057","component":"proc:18","level":"info","message":"Reusing Instance client for Proxy. additionalSystemRoles=[Proxy]","pid":"34558.18","timestamp":"2023-09-27T21:30:05Z"} {"caller":"service/connect.go:166","component":"proc:18","level":"debug","message":"Connected client: Identity(Proxy, cert(c90e905c-76e7-4c68-803b-ba364167ec6f.testcluster issued by testcluster:173887050308815087166604899475019267945),trust root(testcluster:322819974523436048061473591931335284057),trust root(testcluster:173887050308815087166604899475019267945),trust root(testcluster:135083743987735629230336583041497316143))","pid":"34558.18","timestamp":"2023-09-27T21:30:05Z"} {"caller":"service/connect.go:98","component":"proc:18","level":"debug","message":"Connected client Proxy failed to execute test call: rpc error: code = Canceled desc = grpc: the client connection is closing. Node or proxy credentials are out of sync.","pid":"34558.18","timestamp":"2023-09-27T21:30:05Z"} time="2023-09-27T21:30:13Z" level=warning msg="connection problem: readfrom tcp 172.18.0.2:38114->172.18.0.2:41065: use of closed network connection *net.OpError" dest="172.18.0.2:41065" source="172.18.0.2:37496" trace.component=loadbalancer trace.fields="map[listen:8ce2fe8a89f0:0]" time="2023-09-27T21:30:13Z" level=warning msg="Failed to forward connection: readfrom tcp 172.18.0.2:38114->172.18.0.2:41065: use of closed network connection." trace.component=loadbalancer trace.fields="map[listen:8ce2fe8a89f0:0]" time="2023-09-27T21:30:17Z" level=warning msg="Failed to create inventory control stream: rpc error: code = Canceled desc = grpc: the client connection is closing." {"caller":"service/connect.go:124","component":"proc:18","level":"debug","message":"Retrying connection to auth server after waiting 41.323026451s.","pid":"34558.18","timestamp":"2023-09-27T21:30:46Z"} {"caller":"service/connect.go:189","component":"proc:18","level":"debug","message":"Connected state: rotating servers (mode: manual, started: Sep 27 2023 21:29:24 UTC, ending: Sep 29 2023 03:29:24 UTC).","pid":"34558.18","timestamp":"2023-09-27T21:30:46Z"} {"caller":"service/connect.go:1057","component":"proc:18","level":"info","message":"Reusing Instance client for Proxy. additionalSystemRoles=[Proxy]","pid":"34558.18","timestamp":"2023-09-27T21:30:46Z"} ...repeating... ``` The HSM tests have become flaky in the past when reload/reconnect bugs like this have been introduced, but they are long tests that are a bit tricky to run locally and issues like this one can be difficult to diagnose. To try to improve our chances of catching these issues in the future, I've written a new test that starts up an Auth and Proxy process and repeatedly reloads both of them, asserting that the reload is always successful in a reasonable amount of time. The new test is able to catch the bug every time I have run it locally, usually in ~4 out of the 8 parallel invocations to runs. I have not seen any failures with the fix applied. The entire test completes in ~12 seconds on my local machine.
- Loading branch information