Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

Open
easwars opened this issue Oct 25, 2024 · 2 comments · May be fixed by #8095
Open

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

easwars opened this issue Oct 25, 2024 · 2 comments · May be fixed by #8095
Assignees
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Bug

Comments

@easwars
Copy link
Contributor

easwars commented Oct 25, 2024

We haven't see this so far on GitHub Actions, but it seems like this might be a bug in the code rather than in the test.

Full test log here: https://pastebin.com/u2v2JshT

    third_party/golang/grpc/internal/grpctest/grpctest.go:44: Leaked goroutine: goroutine 14686 [select]:
        [google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.](http://google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.)(*outlierDetectionBalancer).run(0xc000f6cf70)
        	third_party/golang/grpc/xds/internal/balancer/outlierdetection/balancer.go:687 +0x285
        created by [google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.bb.Build](http://google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.bb.Build) in goroutine 14678
        	third_party/golang/grpc/xds/internal/balancer/outlierdetection/balancer.go:76 +0x991
    third_party/golang/grpc/internal/grpctest/grpctest.go:44: Leaked goroutine: goroutine 14687 [select]:
        [google3/third_party/golang/grpc/internal/grpcsync/grpcsync.](http://google3/third_party/golang/grpc/internal/grpcsync/grpcsync.)(*CallbackSerializer).run(0xc00120f3a0, {0x1d08a78, 0xc0017093b0})
        	third_party/golang/grpc/internal/grpcsync/callback_serializer.go:88 +0x1e9
        created by [google3/third_party/golang/grpc/internal/grpcsync/grpcsync.NewCallbackSerializer](http://google3/third_party/golang/grpc/internal/grpcsync/grpcsync.NewCallbackSerializer) in goroutine 14678
        	third_party/golang/grpc/internal/grpcsync/callback_serializer.go:52 +0x205
    third_party/golang/grpc/internal/grpctest/grpctest.go:71: Goroutine leak check disabled for future tests

The problem seems to be as follows:

  • There are two priorities with backends in both
  • We start with priority0, and the priority becomes READY, and RPCs succeed
  • The backend then fails, priority0 is no longer usable, and we switch to priority1
  • RPCs succeed to priority1
  • The backend in priority0, comes back up, we switch back to it, and RPCs succeed

The test passes. But there is a leaked goroutine. Basically, the child of priority1, which is outlier detection is not closed. And the child of outlier detection, which is clusterimpl is not closed either.

I believe the problem arises because when priority1 is closed, it is moved to the idle cache in the balancergroup, but when the priority LB is closed soon after, for some reason, the child in the idle cache is not being cleaned up.

This failure happens about 2 times out of 100k, but I feel it is worth investigating.

@purnesh42H purnesh42H added the Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. label Oct 29, 2024
@purnesh42H
Copy link
Contributor

purnesh42H commented Jan 10, 2025

https://github.com/grpc/grpc-go/actions/runs/12713316456/job/35440939066?pr=7977

Actual failure is in Test/ServerSideXDS_FileWatcherCerts.

@purnesh42H purnesh42H added Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. and removed Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. labels Jan 12, 2025
@arjan-bal arjan-bal self-assigned this Feb 17, 2025
@arjan-bal
Copy link
Contributor

The child balancer is orphaned due to a race in the priority balancer which results in balancergroup.Close() being called at the same time as balancergroup.Remove(). When priority-0-0 becomes ready again, the priority balancer calls balancergroup.Remove() to remove priority-0-1.

// Caller must hold b.mu.
func (b *priorityBalancer) stopSubBalancersLowerThanPriority(p int) {
for i := p + 1; i < len(b.priorities); i++ {
name := b.priorities[i]
child, ok := b.children[name]
if !ok {
b.logger.Warningf("Priority name %q is not found in list of child policies", name)
continue
}
child.stop()
}
}

Inside balancergroup, this results in outgoingMu being locked and the balancer being added to a timentoutCache.

bg.outgoingMu.Lock()
sbToRemove, ok := bg.idToBalancerConfig[id]
if !ok {
bg.logger.Errorf("Child policy for child %q does not exist in the balancer group", id)
bg.outgoingMu.Unlock()
return
}
// Unconditionally remove the sub-balancer config from the map.
delete(bg.idToBalancerConfig, id)
if !bg.outgoingStarted {
// Nothing needs to be done here, since we wouldn't have created the
// sub-balancer.
bg.outgoingMu.Unlock()
return
}
if bg.deletedBalancerCache != nil {
if bg.logger.V(2) {
bg.logger.Infof("Adding child policy for child %q to the balancer cache", id)
bg.logger.Infof("Number of items remaining in the balancer cache: %d", bg.deletedBalancerCache.Len())
}
bg.deletedBalancerCache.Add(id, sbToRemove, func() {
if bg.logger.V(2) {
bg.logger.Infof("Removing child policy for child %q from the balancer cache after timeout", id)
bg.logger.Infof("Number of items remaining in the balancer cache: %d", bg.deletedBalancerCache.Len())
}
// A sub-balancer evicted from the timeout cache needs to closed
// and its subConns need to removed, unconditionally. There is a
// possibility that a sub-balancer might be removed (thereby
// moving it to the cache) around the same time that the
// balancergroup is closed, and by the time we get here the
// balancergroup might be closed. Check for `outgoingStarted ==
// true` at that point can lead to a leaked sub-balancer.
bg.outgoingMu.Lock()
sbToRemove.stopBalancer()
bg.outgoingMu.Unlock()
bg.cleanupSubConns(sbToRemove)
})
bg.outgoingMu.Unlock()
return
}

At the same time, the ClientConn starts closing, resulting in balancer tree closing. When the priority balancer begins shutdown, it closes the balancergroup without holding a mutex.

func (b *priorityBalancer) Close() {
b.bg.Close()
b.childBalancerStateUpdate.Close()
b.mu.Lock()
defer b.mu.Unlock()

This results in two concurrent calls into balancergroup: One to Close() and one to Remove(). While handling the call to balancergroup.Close(), the timeout cache is cleared without any mutex being held.

bg.incomingMu.Unlock()
// Clear(true) runs clear function to close sub-balancers in cache. It
// must be called out of outgoing mutex.
if bg.deletedBalancerCache != nil {
bg.deletedBalancerCache.Clear(true)
}
bg.outgoingMu.Lock()

When the cache is cleared before priority-0-1 is added, a balancer is leaked.

@arjan-bal arjan-bal linked a pull request Feb 17, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants