even faster unlock in contention #462

BusyJay · 2025-05-08T06:37:10Z

This is an alternative more aggressive implementation of idea #461.

Compared to #461, this PR

maintains parked bit on waiter side, so that waker doesn't have to atomic operation twice.
reset all lock states back to 0 when unlock. This makes fast lock more likely succeed during high contention.
set PARKED_BIT even waiter is prevented from sleep, so that more threads can be woken up during contention to compete for progress.

Running cargo run --bin mutex --release -- 9:36:9 5 5 2 2:

Running with 9 threads

name	average	median	std.dev.
parking_lot::Mutex (this pr)	472.082 kHz	475.676 kHz	13.820 kHz
parking_lot::Mutex (pr 461)	405.134 kHz	406.469 kHz	9.105 kHz
parking_lot::Mutex (master)	364.841 kHz	365.127 kHz	11.269 kHz
std::sync::Mutex	769.754 kHz	767.714 kHz	23.908 kHz
pthread_mutex_t	982.966 kHz	989.991 kHz	31.816 kHz

Running with 18 threads

name	average	median	std.dev.
parking_lot::Mutex (this pr)	352.643 kHz	353.062 kHz	7.842 kHz
parking_lot::Mutex (pr 461)	268.530 kHz	268.355 kHz	5.586 kHz
parking_lot::Mutex (master)	82.786 kHz	82.975 kHz	2.435 kHz
std::sync::Mutex	389.199 kHz	394.549 kHz	21.358 kHz
pthread_mutex_t	482.005 kHz	489.225 kHz	26.404 kHz

Running with 27 threads

name	average	median	std.dev.
parking_lot::Mutex (this pr)	231.691 kHz	231.891 kHz	3.014 kHz
parking_lot::Mutex (pr 461)	185.802 kHz	186.233 kHz	2.598 kHz
parking_lot::Mutex (master)	28.246 kHz	28.306 kHz	0.443 kHz
std::sync::Mutex	280.553 kHz	280.014 kHz	10.115 kHz
pthread_mutex_t	311.815 kHz	311.582 kHz	8.409 kHz

Running with 36 threads

name	average	median	std.dev.
parking_lot::Mutex (this pr)	157.511 kHz	157.183 kHz	1.764 kHz
parking_lot::Mutex (pr 461)	134.010 kHz	133.784 kHz	1.509 kHz
parking_lot::Mutex (master)	22.055 kHz	22.059 kHz	0.150 kHz
std::sync::Mutex	193.672 kHz	195.078 kHz	10.369 kHz
pthread_mutex_t	224.436 kHz	225.767 kHz	12.334 kHz

Running cargo run --bin rwlock --release -- 36 9 5 5 2 2

name	write	read
parking_lot::RwLock (this pr)	6805.309 kHz	1334.018 kHz
parking_lot::RwLock (pr 461)	6121.347 kHz	968.373 kHz
parking_lot::RwLock (master)	628.062 kHz	954.938 kHz
seqlock::SeqLock	648.979 kHz	152225.000 kHz
pthread_rwlock_t	1678.253 kHz	376.558 kHz

Using lock-bench, cargo run --release 32 2 10000 100:

name	avg	min	max
std::sync::Mutex	30.795793ms	28.369313ms	33.668656ms
parking_lot::Mutex (this pr)	32.460918ms	29.497656ms	34.424228ms
parking_lot::Mutex (pr 461)	40.800542ms	37.16543ms	44.621677ms
parking_lot::Mutex (master)	206.836045ms	183.902676ms	213.697023ms
spin::Mutex	63.898884ms	58.45244ms	74.323676ms
AmdSpinlock	70.131547ms	65.356139ms	83.456119ms

name	avg	min	max
std::sync::Mutex	30.52266ms	28.69828ms	34.945486ms
parking_lot::Mutex (this pr)	31.257888ms	29.648337ms	33.955286ms
parking_lot::Mutex (pr 461)	41.146074ms	38.453175ms	42.433051ms
parking_lot::Mutex (master)	210.387478ms	187.38791ms	215.752182ms
spin::Mutex	62.823716ms	54.801191ms	74.31628ms
AmdSpinlock	68.937325ms	55.406785ms	80.83359ms

This is an alternative more aggressive implementation of idea Amanieu#461. Compared to Amanieu#461, this PR - maintains parked bit on waiter side, so that waker doesn't have to atomic operation twice. - reset all lock states back to 0 when unlock. This makes fast lock more likely succeed during high contention. - set PARKED_BIT even waiter is prevented from sleep, so that more threads can be woken up during contention to compete for progress. Signed-off-by: Jay <[email protected]>

Amanieu

Thanks for this work, I really appreciate it!

I've only reviewed the RawMutex part so far, but it looks very promising.

Amanieu · 2025-05-24T12:17:31Z

src/raw_mutex.rs

            // If we are using a fair unlock then we should keep the
            // mutex locked and hand it off to the unparked thread.
-            if result.unparked_threads != 0 && (force_fair || result.be_fair) {
+            if result.unparked_threads != 0 && force_fair {


The logic here is very different in the unlock_fair case so it would be better to have a separate unlock_fair_slow method.

Amanieu · 2025-05-24T12:18:38Z

src/raw_mutex.rs


    #[cold]
-    fn lock_slow(&self, timeout: Option<Instant>) -> bool {
+    fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool {


Suggested change

fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool {

fn lock_slow(&self, timeout: Option<Instant>, set_parked_bit: bool) -> bool {

I think set_parked_bit is a clearer name for this.

Amanieu · 2025-05-24T12:21:38Z

src/raw_mutex.rs

+                    state | LOCKED_BIT | extra_flags,
                    Ordering::Acquire,
                    Ordering::Relaxed,
                ) {


The spin loop on line 233 should be disabled if set_parked_bit since there are actually parked threads even though the bit isn't set.

Amanieu · 2025-05-25T00:17:35Z

src/condvar.rs

-        } else {
-            mutex.lock();
+        match result {
+            ParkResult::Unparked(TOKEN_HANDOFF) => unreachable!("can't be handed off"),


TOKEN_HANDOFF is actually reachable if we are requeued onto a mutex and then another unlocks that mutex with unlock_fair.

Amanieu · 2025-05-25T00:18:45Z

src/raw_mutex.rs

 // thread directly without unlocking it.
 pub(crate) const TOKEN_HANDOFF: UnparkToken = UnparkToken(1);

+// UnparkToken used to indicate that the waiter should restore PARKED_BIT.


Suggested change

// UnparkToken used to indicate that the waiter should restore PARKED_BIT.

// UnparkToken used to indicate that the waiter should restore PARKED_BIT and then continue attempting to acquire the mutex.

Amanieu · 2025-05-25T00:21:56Z

src/raw_mutex.rs

+                // This thread doesn't sleep, so it's not sure whether it's the last thread
+                // in queue. Setting PARKED_BIT can lead to false wake up. But false wake up
+                // is good for throughput during high contention.
+                ParkResult::Invalid => extra_flags = PARKED_BIT,


I don't think this is correct: we should only set PARKED_BIT if we are required to by an UnparkToken or if the current thread is about to park. Otherwise this just causes unnecessary work.

What numbers do you get on the benchmark without this?

The point here is to ask for more wake. During high contention, random wake up may make more threads stay on CPU. Because high contention means the lock is acquired and released very frequently, more on CPU time means higher possibility to acquire the lock. Leaving CPU and then being scheduled back up one by one is very slow, we should do that only when there is probably no way to make progress anytime soon.

This is also why I name the new arg as in_contention instead of set park bit to highlight that park bit makes more sense as contention than parking.

When thread count is more than 9, the number can be lower than 30% ~ 40% without setting the bit.

The main effect that setting the parked bit has is that it prevents threads from spinning (since we only spin when the parked bit is clear). This has the effect of causing threads to go directly to parking, which as you said is quite slow. However since other threads are no longer actively trying to acquire the lock, it means that one thread can quickly acquire and release the lock since there is no cache interference from other threads.

Although this may look good on benchmarks, it actually isn't good since other threads are wasting time doing work that isn't useful instead of attempting to acquire the lock. This is effectively equivalent to just pausing for a longer period between attempts to acquire the lock.

This has the effect of causing threads to go directly to parking, which as you said is quite slow

Perf stats shows that setting the PARKED_BIT here can lead to more context switch and a lot higher cache miss, this is the prof that more threads are staying on CPU instead of going to sleep.

The reason why PARKED_BIT will wake more threads is because some thread will acquire lock without any competing during contention. For example, if thread a acquire lock, and thread b and c are waiting for a. When a release lock and wake thread b, another thread d that is on CPU right now may acquire lock earlier than thread b. There are two possible behavior of thread d, it can acquire lock directly, or it fails to try lock and try to park but fail again due to validation. Setting parked bit here is utilizing the second situation, so that when d acquire lock, it can still wake thread c later.

it actually isn't good since other threads are wasting time doing work that isn't useful instead of attempting to acquire the lock.

I notice this performance pitfall when I try to implement a linked list with mutex, which is short generally but can grow very long occasionally. After this PR, there is no obvious performance difference between pthread and parking_lot.

In that situation, it's not thread D's job to set the parked bit: thread B will set it before parking itself.

thread B may not as it may be still spinning to try lock.

BusyJay changed the title ~~even fast unlock in contention~~ even faster unlock in contention May 8, 2025

Amanieu reviewed May 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

even faster unlock in contention #462

even faster unlock in contention #462

BusyJay commented May 8, 2025

Uh oh!

Amanieu left a comment

Uh oh!

Amanieu May 24, 2025

Uh oh!

Amanieu May 24, 2025

Uh oh!

Amanieu May 24, 2025

Uh oh!

Amanieu May 25, 2025

Uh oh!

Amanieu May 25, 2025

Uh oh!

Amanieu May 25, 2025

Uh oh!

BusyJay May 26, 2025 •

edited

Loading

Uh oh!

Amanieu May 27, 2025

Uh oh!

BusyJay May 27, 2025 •

edited

Loading

Uh oh!

Amanieu May 27, 2025

Uh oh!

BusyJay May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool {
	fn lock_slow(&self, timeout: Option<Instant>, set_parked_bit: bool) -> bool {

	// UnparkToken used to indicate that the waiter should restore PARKED_BIT.
	// UnparkToken used to indicate that the waiter should restore PARKED_BIT and then continue attempting to acquire the mutex.

even faster unlock in contention #462

Are you sure you want to change the base?

even faster unlock in contention #462

Conversation

BusyJay commented May 8, 2025

Uh oh!

Amanieu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BusyJay May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BusyJay May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BusyJay May 26, 2025 •

edited

Loading

BusyJay May 27, 2025 •

edited

Loading