Skip to content

Conversation

@easwars
Copy link
Contributor

@easwars easwars commented Nov 14, 2025

Fixes #7686

Current Behavior

  • When client exits IDLE and creates the name resolver, it stays in IDLE until the connectivity state is set by the LB policy.
  • When exiting IDLE mode (because of Connect being called or because of an RPC), if name resolver creation fails, we stay in IDLE.

New Behavior

  • When the client exits IDLE and creates the name resolver, it moves to CONNECTING. Moving forward, the connectivity state will be set by the LB policy.
  • When exiting IDLE mode (because of Connect being called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer fires

RELEASE NOTES:

  • client: Change connectivity state to CONNECTING when creating the name resolver (as part of exiting IDLE).
  • client: Change connectivity state to TRANSIENT_FAILURE if name resolver creation fails.
  • client: Change connectivity state to IDLE after idle timeout expires (also when current state is TRANSIENT_FAILURE).

@easwars easwars requested a review from dfawley November 14, 2025 21:48
@easwars easwars added Type: Bug Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Nov 14, 2025
@easwars easwars added this to the 1.78 Release milestone Nov 14, 2025
@easwars easwars requested a review from arjan-bal November 14, 2025 21:48
@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.20%. Comparing base (112ec12) to head (bd0ee2b).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8710      +/-   ##
==========================================
- Coverage   83.28%   83.20%   -0.09%     
==========================================
  Files         416      419       +3     
  Lines       32267    32431     +164     
==========================================
+ Hits        26874    26983     +109     
- Misses       4019     4061      +42     
- Partials     1374     1387      +13     
Files with missing lines Coverage Δ
clientconn.go 90.54% <100.00%> (+0.40%) ⬆️
internal/idle/idle.go 89.28% <100.00%> (+0.12%) ⬆️
resolver_wrapper.go 92.45% <ø> (ø)

... and 27 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dial_test.go Outdated

func (s stringerVal) String() string { return s.s }

const errResolverBuildercheme = "test-resolver-build-failure"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 84 to 92
// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md
// defines CONNECTING as follows:
// - The channel is trying to establish a connection and is waiting to
// make progress on one of the steps involved in name resolution, TCP
// connection establishment or TLS handshake. This may be used as the
// initial state for channels upon creation.
//
// We are starting the name resolver here as part of exiting IDLE, so
// transitioning to CONNECTING is the right thing to do.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO comments should be short and to the point.

Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.

Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:

// Set state to CONNECTING before building the name resolver
// so the channel does not remain in IDLE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 740 to 742
if state := cc.GetState(); state != connectivity.Idle {
t.Fatalf("Expected initial state to be IDLE, got %v", state)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AwaitState above already tested this IIUC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Ensure that the client is in IDLE before connecting.
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
defer cancel()
testutils.AwaitState(ctx, t, cc, connectivity.Idle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Moved the check for the current state to here, and got rid of the Await.

for _, wantState := range wantStates {
waitForState(ctx, t, stateCh, wantState)
if wantState == connectivity.Idle {
tt.exitIdleFunc(ctx, cc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this test the actual RPC error when we use an RPC to exit idle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I changed the test a little to make that happen.

But as part of doing that, I realized a thing or two:

  • The RPC is not failing with a status error, because the idle.Manager.ExitIdleMode which is called when an RPC has to kick the channel out of IDLE, does not embed the error returned from ClientConn.ExitIdleMode. But if we make it embed the error, we will have to return a status error from the latter, which is doable. But that brings the following questions:
    • What code do we return? I'm torn between Unavailable and Internal, and leaning towards the latter
    • This would also make Dial fail with a status error which I find a little odd.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Unavailable and then it's treated the same as if the resolver did a ReportError immediately instead of failing to build? But probably we should see what the C++/Java lame/failing channels do before deciding, since that seems like the most equivalent scenario in languages where the resolver can't fail to build -- it just doesn't exist.

//
// We are starting the name resolver here as part of exiting IDLE, so
// transitioning to CONNECTING is the right thing to do.
ccr.cc.csMgr.updateState(connectivity.Connecting)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this (and the error handling below) go into exitIdleMode instead? It might make more sense to be in the channel directly, rather than in this sub-component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a test for the initial channel state while waiting for the first name resolver update

3 participants