-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Labels
Area: xDSIncludes everything xDS related, including LB policies used with xDS.Includes everything xDS related, including LB policies used with xDS.Type: Bug
Description
Here is a sequence of events that can lead to the ADS stream level flow control blocking forever.
- T-1: Listener resource is subscribed to and the request has been sent out
- T0:
recv()
receives the Listener resource from the wire:resources, url, version, nonce, err := s.recvMessage(stream) - T1:
recv()
sets thepending
bit of the flow control totrue
and invokes the response handler and passes it theonDone
callback:resourceNames, nackErr = s.eventHandler.onResponse(resp, s.fc.onDone) - T2: Response handler runs, and as part of handling the update, it subscribes to an RouteConfiguration resource. This results in
subscribe()
being called on the ADS stream, which queues the request:s.requestCh.Put(typ) - T3: Response handler invokes the
onDone
callback to release flow control. This writes to thereadyCh
to unblock goroutines waiting for flow control. It hasn't yet set thepending
bit tofalse
:case fc.readyCh <- struct{}{}: - T4: Meanwhile, the
send()
goroutine gets to run and processes the request for the RouteConfiguration. It callssendNew()
to send this request out:if err := s.sendNew(stream, typ); err != nil { - T5:
sendNew()
checks thepending
bit of the flow control. This is not yet set tofalse
by theonDone
callback. It will try to buffer this request, but before that happens, it loses CPU:if s.fc.pending.Load() { - T6: Meanwhile
recv()
is in the next iteration of thefor
loop and has gotten unblocked on the call tofc.wait()
:if !s.fc.wait(ctx) { - T7:
recv()
attempts to send out any buffered requests by callingsendBuffered
, but that method does not find any buffered requests, becausesendNew()
hasn't yet written to thebufferedRequests
channel.func (s *adsStreamImpl) sendBuffered(stream clients.Stream) error { - T8:
sendNew()
now writes to thebufferedRequests
channel. - Anytime after T5: the
onDone
callback sets thepending
bit tofalse
.
But this request (buffered at T8) never gets sent out, because recv()
is blocked waiting for some response from the management server, but no response is expected because the ADS stream has not requested any new resource.This will eventually lead to the RDS resource watch timer expiring, and being reported to the watcher as a resource-not-found error.
arjan-bal
Metadata
Metadata
Assignees
Labels
Area: xDSIncludes everything xDS related, including LB policies used with xDS.Includes everything xDS related, including LB policies used with xDS.Type: Bug