Skip to content

PickFirstLeafLoadBalancer may get stuck in Idle on backend address change #12395

@arjan-bal

Description

@arjan-bal

A bug was found in the new pickfirst balancer in gRPC Go which can cause the balancer to get stuck in IDLE state: grpc/grpc-go#8615.

The implementation of the pickfirst LB in Java is similar, so it may suffer from the same issue. The order of events that lead to the bug is as follows:

  1. Existing connection breaks, the balancer requests re-resolution and reports IDLE. PF updates the channel state to IDLE with an Idle picker.
  2. An RPC is made, triggering the balancer to exit idle through the picker. The balancer attempts to re-connect the failed subchannel.
  3. The resolver produces a new endpoint list, removing the endpoint used by the existing subchannel. PF removes the existing subchannel. Since the balancer didn't update the channel state to CONNECTING yet, pickfirst thinks that it's still in IDLE and doesn't start connecting to the new endpoints.
  4. New RPC requests trigger the idle picker, but it's a no-op since it only triggers the balancer's ExitIdle method once.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions