Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

easwars · 2024-10-25T21:05:04Z

We haven't see this so far on GitHub Actions, but it seems like this might be a bug in the code rather than in the test.

Full test log here: https://pastebin.com/u2v2JshT

    third_party/golang/grpc/internal/grpctest/grpctest.go:44: Leaked goroutine: goroutine 14686 [select]:
        [google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.](http://google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.)(*outlierDetectionBalancer).run(0xc000f6cf70)
        	third_party/golang/grpc/xds/internal/balancer/outlierdetection/balancer.go:687 +0x285
        created by [google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.bb.Build](http://google3/third_party/golang/grpc/xds/internal/balancer/outlierdetection/outlierdetection.bb.Build) in goroutine 14678
        	third_party/golang/grpc/xds/internal/balancer/outlierdetection/balancer.go:76 +0x991
    third_party/golang/grpc/internal/grpctest/grpctest.go:44: Leaked goroutine: goroutine 14687 [select]:
        [google3/third_party/golang/grpc/internal/grpcsync/grpcsync.](http://google3/third_party/golang/grpc/internal/grpcsync/grpcsync.)(*CallbackSerializer).run(0xc00120f3a0, {0x1d08a78, 0xc0017093b0})
        	third_party/golang/grpc/internal/grpcsync/callback_serializer.go:88 +0x1e9
        created by [google3/third_party/golang/grpc/internal/grpcsync/grpcsync.NewCallbackSerializer](http://google3/third_party/golang/grpc/internal/grpcsync/grpcsync.NewCallbackSerializer) in goroutine 14678
        	third_party/golang/grpc/internal/grpcsync/callback_serializer.go:52 +0x205
    third_party/golang/grpc/internal/grpctest/grpctest.go:71: Goroutine leak check disabled for future tests

The problem seems to be as follows:

There are two priorities with backends in both
We start with priority0, and the priority becomes READY, and RPCs succeed
The backend then fails, priority0 is no longer usable, and we switch to priority1
RPCs succeed to priority1
The backend in priority0, comes back up, we switch back to it, and RPCs succeed

The test passes. But there is a leaked goroutine. Basically, the child of priority1, which is outlier detection is not closed. And the child of outlier detection, which is clusterimpl is not closed either.

I believe the problem arises because when priority1 is closed, it is moved to the idle cache in the balancergroup, but when the priority LB is closed soon after, for some reason, the child in the idle cache is not being cleaned up.

This failure happens about 2 times out of 100k, but I feel it is worth investigating.

The text was updated successfully, but these errors were encountered:

purnesh42H · 2025-01-10T16:33:29Z

~~https://github.com/grpc/grpc-go/actions/runs/12713316456/job/35440939066?pr=7977~~

Actual failure is in Test/ServerSideXDS_FileWatcherCerts.

arjan-bal · 2025-02-17T03:58:57Z

The child balancer is orphaned due to a race in the priority balancer which results in balancergroup.Close() being called at the same time as balancergroup.Remove(). When priority-0-0 becomes ready again, the priority balancer calls balancergroup.Remove() to remove priority-0-1.

grpc-go/xds/internal/balancer/priority/balancer_priority.go

Lines 122 to 133 in ae2a04f

    
           // Caller must hold b.mu. 
        
           func (b *priorityBalancer) stopSubBalancersLowerThanPriority(p int) { 
        
           	for i := p + 1; i < len(b.priorities); i++ { 
        
           		name := b.priorities[i] 
        
           		child, ok := b.children[name] 
        
           		if !ok { 
        
           			b.logger.Warningf("Priority name %q is not found in list of child policies", name) 
        
           			continue 
        
           		} 
        
           		child.stop() 
        
           	} 
        
           }

Inside balancergroup, this results in outgoingMu being locked and the balancer being added to a timentoutCache.

grpc-go/internal/balancergroup/balancergroup.go

Lines 371 to 415 in ae2a04f

    
           bg.outgoingMu.Lock() 
        
           sbToRemove, ok := bg.idToBalancerConfig[id] 
        
           if !ok { 
        
           	bg.logger.Errorf("Child policy for child %q does not exist in the balancer group", id) 
        
           	bg.outgoingMu.Unlock() 
        
           	return 
        
           } 
        
           // Unconditionally remove the sub-balancer config from the map. 
        
           delete(bg.idToBalancerConfig, id) 
        
           if !bg.outgoingStarted { 
        
           	// Nothing needs to be done here, since we wouldn't have created the 
        
           	// sub-balancer. 
        
           	bg.outgoingMu.Unlock() 
        
           	return 
        
           } 
        
           if bg.deletedBalancerCache != nil { 
        
           	if bg.logger.V(2) { 
        
           		bg.logger.Infof("Adding child policy for child %q to the balancer cache", id) 
        
           		bg.logger.Infof("Number of items remaining in the balancer cache: %d", bg.deletedBalancerCache.Len()) 
        
           	} 
        
           	bg.deletedBalancerCache.Add(id, sbToRemove, func() { 
        
           		if bg.logger.V(2) { 
        
           			bg.logger.Infof("Removing child policy for child %q from the balancer cache after timeout", id) 
        
           			bg.logger.Infof("Number of items remaining in the balancer cache: %d", bg.deletedBalancerCache.Len()) 
        
           		} 
        
           		// A sub-balancer evicted from the timeout cache needs to closed 
        
           		// and its subConns need to removed, unconditionally. There is a 
        
           		// possibility that a sub-balancer might be removed (thereby 
        
           		// moving it to the cache) around the same time that the 
        
           		// balancergroup is closed, and by the time we get here the 
        
           		// balancergroup might be closed.  Check for `outgoingStarted == 
        
           		// true` at that point can lead to a leaked sub-balancer. 
        
           		bg.outgoingMu.Lock() 
        
           		sbToRemove.stopBalancer() 
        
           		bg.outgoingMu.Unlock() 
        
           		bg.cleanupSubConns(sbToRemove) 
        
           	}) 
        
           	bg.outgoingMu.Unlock() 
        
           	return 
        
           }

At the same time, the ClientConn starts closing, resulting in balancer tree closing. When the priority balancer begins shutdown, it closes the balancergroup without holding a mutex.

grpc-go/xds/internal/balancer/priority/balancer.go

Lines 221 to 226 in ae2a04f

    
           func (b *priorityBalancer) Close() { 
        
           	b.bg.Close() 
        
           	b.childBalancerStateUpdate.Close() 
        
           	b.mu.Lock() 
        
           	defer b.mu.Unlock()

This results in two concurrent calls into balancergroup: One to Close() and one to Remove(). While handling the call to balancergroup.Close(), the timeout cache is cleared without any mutex being held.

grpc-go/internal/balancergroup/balancergroup.go

Lines 561 to 569 in ae2a04f

    
           bg.incomingMu.Unlock() 
        
           // Clear(true) runs clear function to close sub-balancers in cache. It 
        
           // must be called out of outgoing mutex. 
        
           if bg.deletedBalancerCache != nil { 
        
           	bg.deletedBalancerCache.Clear(true) 
        
           } 
        
           bg.outgoingMu.Lock()

When the cache is cleared before priority-0-1 is added, a balancer is leaked.

easwars added the Type: Bug label Oct 25, 2024

purnesh42H added the Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. label Oct 29, 2024

purnesh42H added Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. and removed Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. labels Jan 12, 2025

arjan-bal self-assigned this Feb 17, 2025

arjan-bal linked a pull request Feb 17, 2025 that will close this issue

priority: Lock mu while calling into balancergroup #8095

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

easwars commented Oct 25, 2024 •

edited by arjan-bal

Loading

purnesh42H commented Jan 10, 2025 •

edited by arjan-bal

Loading

arjan-bal commented Feb 17, 2025

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack #7783

Comments

easwars commented Oct 25, 2024 • edited by arjan-bal Loading

purnesh42H commented Jan 10, 2025 • edited by arjan-bal Loading

arjan-bal commented Feb 17, 2025

easwars commented Oct 25, 2024 •

edited by arjan-bal

Loading

purnesh42H commented Jan 10, 2025 •

edited by arjan-bal

Loading