Skip to content

Conversation

@MichaReiser
Copy link
Contributor

@MichaReiser MichaReiser commented Oct 19, 2025

This PR fixes three bugs that could lead to hangs when fixpoint cycles spanned multiple threads:

  1. Race between transfer_lock and block_on_heads: Salsa detects outer cycles by claiming a query's lock and seeing if doing so would result in a cycle. For this to work, it's crucial that a thread participating in a cycle never changes from "blocked" to running unless it's the thread currently driving the cycle forward. However, this could happen before this fix because the thread of an inner cycle kept running after completing transfer_lock (to the outer query) until it blocked on one of its cycle heads in provisional_retry (block_on_heads). This PR fixes this race by removing provisional_retry and instead blocking the thread of the inner cycle inside transfer_lock. This also results in a nice perf improvement because we no longer need to iterate over cycle heads.
  2. Before this PR, TryClaimCycleHeadsIter first tested if the head is on the current thread's query stack, and if so, returned the iteration count from the query stack. However, the iteration count on the query stack isn't guaranteed to match the iteration count of the latest memo when the query's ownership was transferred to another thread (and this thread was blocked on the other thread). In that case, the other thread has most likely iterated the query multiple times and we have to refetch the latest memo to get the "newest" iteration count. The fix is easy, remove the local query stack "optimisation" from TryClaimCycleHeadsIter.
  3. This was subtle, but thread_id_of_transferred_query skip behavior was slightly different from how transfer_lock rewrites the transferred locks.

Test Plan

Claude helped me write a new test for 1. I wasn't able to come up with test cases that trigger 2 or 3, but I verified that we no longer see the hangs in ty.

same query blocking on the outer head in `provisional_retry`
@netlify
Copy link

netlify bot commented Oct 19, 2025

Deploy Preview for salsa-rs canceled.

Name Link
🔨 Latest commit 48e53cd
🔍 Latest deploy log https://app.netlify.com/projects/salsa-rs/deploys/68f5d4d6fec6c400089c632f

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 19, 2025

CodSpeed Performance Report

Merging #1010 will degrade performances by 4.3%

Comparing MichaReiser:fixpoint-race (48e53cd) with master (9cfe41c)

Summary

⚡ 1 improvement
❌ 1 (👁 1) regression
✅ 11 untouched

Benchmarks breakdown

Benchmark BASE HEAD Change
👁 amortized[SupertypeInput] 2.8 µs 2.9 µs -4.3%
converge_diverge_nested 120.5 µs 111.6 µs +8%

@MichaReiser MichaReiser added the bug Something isn't working label Oct 19, 2025
@MichaReiser MichaReiser marked this pull request as ready for review October 19, 2025 08:50
@MichaReiser MichaReiser added this pull request to the merge queue Oct 23, 2025
Merged via the queue into salsa-rs:master with commit 16d51d6 Oct 23, 2025
10 of 12 checks passed
@MichaReiser MichaReiser deleted the fixpoint-race branch October 23, 2025 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants