Fix hangs in multithreaded fixpoint iteration #1010
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes three bugs that could lead to hangs when fixpoint cycles spanned multiple threads:
transfer_lockandblock_on_heads: Salsa detects outer cycles by claiming a query's lock and seeing if doing so would result in a cycle. For this to work, it's crucial that a thread participating in a cycle never changes from "blocked" to running unless it's the thread currently driving the cycle forward. However, this could happen before this fix because the thread of an inner cycle kept running after completingtransfer_lock(to the outer query) until it blocked on one of its cycle heads inprovisional_retry(block_on_heads). This PR fixes this race by removingprovisional_retryand instead blocking the thread of the inner cycle insidetransfer_lock. This also results in a nice perf improvement because we no longer need to iterate over cycle heads.TryClaimCycleHeadsIterfirst tested if the head is on the current thread's query stack, and if so, returned the iteration count from the query stack. However, the iteration count on the query stack isn't guaranteed to match the iteration count of the latest memo when the query's ownership was transferred to another thread (and this thread was blocked on the other thread). In that case, the other thread has most likely iterated the query multiple times and we have to refetch the latest memo to get the "newest" iteration count. The fix is easy, remove the local query stack "optimisation" fromTryClaimCycleHeadsIter.thread_id_of_transferred_queryskip behavior was slightly different from howtransfer_lockrewrites the transferred locks.Test Plan
Claude helped me write a new test for 1. I wasn't able to come up with test cases that trigger 2 or 3, but I verified that we no longer see the hangs in ty.