Skip to content

Conversation

@BusyJay
Copy link

@BusyJay BusyJay commented May 8, 2025

This is an alternative more aggressive implementation of idea #461.

Compared to #461, this PR

  • maintains parked bit on waiter side, so that waker doesn't have to atomic operation twice.
  • reset all lock states back to 0 when unlock. This makes fast lock more likely succeed during high contention.
  • set PARKED_BIT even waiter is prevented from sleep, so that more threads can be woken up during contention to compete for progress.

Running cargo run --bin mutex --release -- 9:36:9 5 5 2 2:

Running with 9 threads

name average median std.dev.
parking_lot::Mutex (this pr) 472.082 kHz 475.676 kHz 13.820 kHz
parking_lot::Mutex (pr 461) 405.134 kHz 406.469 kHz 9.105 kHz
parking_lot::Mutex (master) 364.841 kHz 365.127 kHz 11.269 kHz
std::sync::Mutex 769.754 kHz 767.714 kHz 23.908 kHz
pthread_mutex_t 982.966 kHz 989.991 kHz 31.816 kHz

Running with 18 threads

name average median std.dev.
parking_lot::Mutex (this pr) 352.643 kHz 353.062 kHz 7.842 kHz
parking_lot::Mutex (pr 461) 268.530 kHz 268.355 kHz 5.586 kHz
parking_lot::Mutex (master) 82.786 kHz 82.975 kHz 2.435 kHz
std::sync::Mutex 389.199 kHz 394.549 kHz 21.358 kHz
pthread_mutex_t 482.005 kHz 489.225 kHz 26.404 kHz

Running with 27 threads

name average median std.dev.
parking_lot::Mutex (this pr) 231.691 kHz 231.891 kHz 3.014 kHz
parking_lot::Mutex (pr 461) 185.802 kHz 186.233 kHz 2.598 kHz
parking_lot::Mutex (master) 28.246 kHz 28.306 kHz 0.443 kHz
std::sync::Mutex 280.553 kHz 280.014 kHz 10.115 kHz
pthread_mutex_t 311.815 kHz 311.582 kHz 8.409 kHz

Running with 36 threads

name average median std.dev.
parking_lot::Mutex (this pr) 157.511 kHz 157.183 kHz 1.764 kHz
parking_lot::Mutex (pr 461) 134.010 kHz 133.784 kHz 1.509 kHz
parking_lot::Mutex (master) 22.055 kHz 22.059 kHz 0.150 kHz
std::sync::Mutex 193.672 kHz 195.078 kHz 10.369 kHz
pthread_mutex_t 224.436 kHz 225.767 kHz 12.334 kHz

Running cargo run --bin rwlock --release -- 36 9 5 5 2 2

name write read
parking_lot::RwLock (this pr) 6805.309 kHz 1334.018 kHz
parking_lot::RwLock (pr 461) 6121.347 kHz 968.373 kHz
parking_lot::RwLock (master) 628.062 kHz 954.938 kHz
seqlock::SeqLock 648.979 kHz 152225.000 kHz
pthread_rwlock_t 1678.253 kHz 376.558 kHz

Using lock-bench, cargo run --release 32 2 10000 100:

name avg min max
std::sync::Mutex 30.795793ms 28.369313ms 33.668656ms
parking_lot::Mutex (this pr) 32.460918ms 29.497656ms 34.424228ms
parking_lot::Mutex (pr 461) 40.800542ms 37.16543ms 44.621677ms
parking_lot::Mutex (master) 206.836045ms 183.902676ms 213.697023ms
spin::Mutex 63.898884ms 58.45244ms 74.323676ms
AmdSpinlock 70.131547ms 65.356139ms 83.456119ms
name avg min max
std::sync::Mutex 30.52266ms 28.69828ms 34.945486ms
parking_lot::Mutex (this pr) 31.257888ms 29.648337ms 33.955286ms
parking_lot::Mutex (pr 461) 41.146074ms 38.453175ms 42.433051ms
parking_lot::Mutex (master) 210.387478ms 187.38791ms 215.752182ms
spin::Mutex 62.823716ms 54.801191ms 74.31628ms
AmdSpinlock 68.937325ms 55.406785ms 80.83359ms

This is an alternative more aggressive implementation of idea Amanieu#461.

Compared to Amanieu#461, this PR
- maintains parked bit on waiter side, so that waker doesn't
  have to atomic operation twice.
- reset all lock states back to 0 when unlock. This makes fast lock
  more likely succeed during high contention.
- set PARKED_BIT even waiter is prevented from sleep, so that more
  threads can be woken up during contention to compete for progress.

Signed-off-by: Jay <[email protected]>
@BusyJay BusyJay changed the title even fast unlock in contention even faster unlock in contention May 8, 2025
Copy link
Owner

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work, I really appreciate it!

I've only reviewed the RawMutex part so far, but it looks very promising.

// If we are using a fair unlock then we should keep the
// mutex locked and hand it off to the unparked thread.
if result.unparked_threads != 0 && (force_fair || result.be_fair) {
if result.unparked_threads != 0 && force_fair {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is very different in the unlock_fair case so it would be better to have a separate unlock_fair_slow method.


#[cold]
fn lock_slow(&self, timeout: Option<Instant>) -> bool {
fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool {
fn lock_slow(&self, timeout: Option<Instant>, set_parked_bit: bool) -> bool {

I think set_parked_bit is a clearer name for this.

state | LOCKED_BIT | extra_flags,
Ordering::Acquire,
Ordering::Relaxed,
) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spin loop on line 233 should be disabled if set_parked_bit since there are actually parked threads even though the bit isn't set.

} else {
mutex.lock();
match result {
ParkResult::Unparked(TOKEN_HANDOFF) => unreachable!("can't be handed off"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOKEN_HANDOFF is actually reachable if we are requeued onto a mutex and then another unlocks that mutex with unlock_fair.

// thread directly without unlocking it.
pub(crate) const TOKEN_HANDOFF: UnparkToken = UnparkToken(1);

// UnparkToken used to indicate that the waiter should restore PARKED_BIT.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// UnparkToken used to indicate that the waiter should restore PARKED_BIT.
// UnparkToken used to indicate that the waiter should restore PARKED_BIT and then continue attempting to acquire the mutex.

// This thread doesn't sleep, so it's not sure whether it's the last thread
// in queue. Setting PARKED_BIT can lead to false wake up. But false wake up
// is good for throughput during high contention.
ParkResult::Invalid => extra_flags = PARKED_BIT,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct: we should only set PARKED_BIT if we are required to by an UnparkToken or if the current thread is about to park. Otherwise this just causes unnecessary work.

What numbers do you get on the benchmark without this?

Copy link
Author

@BusyJay BusyJay May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is to ask for more wake. During high contention, random wake up may make more threads stay on CPU. Because high contention means the lock is acquired and released very frequently, more on CPU time means higher possibility to acquire the lock. Leaving CPU and then being scheduled back up one by one is very slow, we should do that only when there is probably no way to make progress anytime soon.

This is also why I name the new arg as in_contention instead of set park bit to highlight that park bit makes more sense as contention than parking.

When thread count is more than 9, the number can be lower than 30% ~ 40% without setting the bit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main effect that setting the parked bit has is that it prevents threads from spinning (since we only spin when the parked bit is clear). This has the effect of causing threads to go directly to parking, which as you said is quite slow. However since other threads are no longer actively trying to acquire the lock, it means that one thread can quickly acquire and release the lock since there is no cache interference from other threads.

Although this may look good on benchmarks, it actually isn't good since other threads are wasting time doing work that isn't useful instead of attempting to acquire the lock. This is effectively equivalent to just pausing for a longer period between attempts to acquire the lock.

Copy link
Author

@BusyJay BusyJay May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the effect of causing threads to go directly to parking, which as you said is quite slow

Perf stats shows that setting the PARKED_BIT here can lead to more context switch and a lot higher cache miss, this is the prof that more threads are staying on CPU instead of going to sleep.

The reason why PARKED_BIT will wake more threads is because some thread will acquire lock without any competing during contention. For example, if thread a acquire lock, and thread b and c are waiting for a. When a release lock and wake thread b, another thread d that is on CPU right now may acquire lock earlier than thread b. There are two possible behavior of thread d, it can acquire lock directly, or it fails to try lock and try to park but fail again due to validation. Setting parked bit here is utilizing the second situation, so that when d acquire lock, it can still wake thread c later.

it actually isn't good since other threads are wasting time doing work that isn't useful instead of attempting to acquire the lock.

I notice this performance pitfall when I try to implement a linked list with mutex, which is short generally but can grow very long occasionally. After this PR, there is no obvious performance difference between pthread and parking_lot.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that situation, it's not thread D's job to set the parked bit: thread B will set it before parking itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread B may not as it may be still spinning to try lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants