rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

ulascansenturk · 2025-11-20T19:16:51Z

Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY.

RELEASE NOTES: N/A

Fixes #8693

codecov · 2025-11-20T19:20:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.22%. Comparing base (50c6321) to head (ed5ab2c).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #8720   +/-   ##
=======================================
  Coverage   83.21%   83.22%           
=======================================
  Files         419      419           
  Lines       32427    32437   +10     
=======================================
+ Hits        26985    26995   +10     
- Misses       4054     4056    +2     
+ Partials     1388     1386    -2

Files with missing lines	Coverage Δ
balancer/rls/control_channel.go	`88.23% <100.00%> (+1.27%)`	⬆️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY. Fixes grpc#8693

eshitachandwani · 2025-11-21T08:30:28Z

balancer/rls/control_channel.go

+				cc.backToReadyFunc()
+				seenTransientFailure = false
+			} else {
+				cc.logger.Infof("Control channel back to READY (no prior failure)")


I think this comment can be improved and made a little more explicit for ease of users.

eshitachandwani · 2025-11-21T08:44:11Z

balancer/rls/control_channel_test.go

+			}
+
+			// Give extra time for any pending callbacks
+			time.Sleep(100 * time.Millisecond)


I dont think adding time.Sleep() to wait for state changes is a good idea, since it can make tests flake if sometimes the state transitions take longer, we should look for better guarantees to make sure the states have transitioned.
cc : @easwars

- Add testOnlyInitialReadyDone channel for proper test synchronization - Signal when monitoring goroutine processes initial READY state - Tests wait for this signal instead of using time.Sleep - All synchronization now uses channels/callbacks - no arbitrary sleeps - Tests pass consistently with race detector Addresses review feedback about removing time.Sleep for state transitions.

Copilot

Pull request overview

This PR fixes control channel connectivity monitoring in the RLS balancer to only reset backoff timers when genuinely recovering from a TRANSIENT_FAILURE state, not during benign state changes like READY → IDLE → READY.

Adds explicit tracking of TRANSIENT_FAILURE state with a boolean flag
Updates callback invocation logic to only trigger after recovery from TRANSIENT_FAILURE
Adds comprehensive test coverage for various state transition scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
balancer/rls/control_channel.go	Implements TRANSIENT_FAILURE tracking with a boolean flag, adds nil check for callback, and includes test synchronization channel
balancer/rls/control_channel_test.go	Adds comprehensive test cases covering different connectivity state transition scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

easwars · 2025-11-21T22:43:41Z

balancer/rls/control_channel.go

+	// testOnlyInitialReadyDone is closed when the monitoring goroutine
+	// processes the initial READY state. Only used in tests.
+	testOnlyInitialReadyDone chan struct{}


I have we have test-only code/hooks in some parts of the code. But it would be nice to avoid these.

easwars · 2025-11-21T22:44:31Z

balancer/rls/control_channel.go

FYI: This is the exact text that describes the expected behavior:

The policy will monitor the state of the control plane channel. When the state transitions to TRANSIENT_FAILURE, it will record that transition, and the next time it transitions to READY, the policy will iterate through the cache to reset the backoff timeouts in all cache entries. Specifically, this means that it will reset the backoff state and cancel the pending backoff timer. Note that when cancelling the backoff timer, just like when the backoff timer fires normally, a new picker is returned to the channel, to force it to re-process any wait-for-ready RPCs that may still be queued if we failed them while we were in backoff. However, we should optimize this case by returning only one new picker, regardless of how many backoff timers are cancelled.

Based on the above text, we don't even have to wait for the first time the control channel goes READY. This means, that we can simplify the code quite a bit and not even have a control channel connectivity state monitoring goroutine. All we need is the following:

Continue to subscribe to connectivity state changes as we do today when we create the RLS control channel:

grpc-go/balancer/rls/control_channel.go

Line 91 in cdbafd3

ctrlCh.unsubscribe = internal.SubscribeToConnectivityStateChanges.(func(cc *grpc.ClientConn, s grpcsync.Subscriber) func())(ctrlCh.cc, ctrlCh)

In the implementation of the grpcsync.Subscriber interface, we currently push the received connectivity state update on to an unbounded buffer here:

grpc-go/balancer/rls/control_channel.go

Line 104 in cdbafd3

cc.connectivityStateCh.Put(st)

The above buffer is read from the for loop in the monitoring goroutine here:

grpc-go/balancer/rls/control_channel.go

Line 177 in cdbafd3

for s, ok := <-cc.connectivityStateCh.Get(); s != connectivity.Ready; s, ok = <-cc.connectivityStateCh.Get() {

Instead, what we can do is:

func (cc *controlChannel) OnMessage(msg any) { st, ok := msg.(connectivity.State) if !ok { panic(fmt.Sprintf("Unexpected message type %T , wanted connectectivity.State type", msg)) } - If new connectivity state is READY, and we have previously seen TRANSIENT_FAILURE: - set the boolean for tracking previously seen TRANSIENT_FAILURE to false - reset backoffs by invoking the `backToReadyFunc` - else if new connectivity state is TRANSIENT_FAILURE - set the boolean for tracking previously seen TRANSIENT_FAILURE to true - else - do nothing }

The above if-elseif-else can also be implemented as a switch and the linter might complain if that is not the case.

ulascansenturk force-pushed the fix/8693-rls-control-channel-state-monitoring branch from e2406a9 to a6fcb7e Compare November 20, 2025 20:42

eshitachandwani added this to the 1.78 Release milestone Nov 21, 2025

eshitachandwani added the Type: Bug label Nov 21, 2025

eshitachandwani reviewed Nov 21, 2025

View reviewed changes

eshitachandwani requested a review from easwars November 21, 2025 08:44

eshitachandwani assigned easwars and ulascansenturk Nov 21, 2025

ulascansenturk added 4 commits November 21, 2025 12:13

address reviews

ca49ae7

address reviews

c7eb618

Shorter comment

943240d

ulascansenturk force-pushed the fix/8693-rls-control-channel-state-monitoring branch from 135b43d to ed5ab2c Compare November 21, 2025 11:01

ulascansenturk requested a review from eshitachandwani November 21, 2025 11:16

easwars requested a review from Copilot November 21, 2025 18:58

Copilot started reviewing on behalf of easwars November 21, 2025 18:59 View session

Copilot finished reviewing on behalf of easwars November 21, 2025 19:00

Copilot AI reviewed Nov 21, 2025

View reviewed changes

easwars reviewed Nov 21, 2025

View reviewed changes

easwars removed their assignment Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

ulascansenturk commented Nov 20, 2025

Uh oh!

codecov bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

eshitachandwani Nov 21, 2025

Uh oh!

eshitachandwani Nov 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

easwars Nov 21, 2025

Uh oh!

easwars Nov 21, 2025

Uh oh!

easwars Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

Are you sure you want to change the base?

rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

Conversation

ulascansenturk commented Nov 20, 2025

Uh oh!

codecov bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eshitachandwani Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

eshitachandwani Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

easwars Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

easwars Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

easwars Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 20, 2025 •

edited

Loading