-
Notifications
You must be signed in to change notification settings - Fork 265
even faster unlock in contention #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is an alternative more aggressive implementation of idea Amanieu#461. Compared to Amanieu#461, this PR - maintains parked bit on waiter side, so that waker doesn't have to atomic operation twice. - reset all lock states back to 0 when unlock. This makes fast lock more likely succeed during high contention. - set PARKED_BIT even waiter is prevented from sleep, so that more threads can be woken up during contention to compete for progress. Signed-off-by: Jay <[email protected]>
Amanieu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work, I really appreciate it!
I've only reviewed the RawMutex part so far, but it looks very promising.
| // If we are using a fair unlock then we should keep the | ||
| // mutex locked and hand it off to the unparked thread. | ||
| if result.unparked_threads != 0 && (force_fair || result.be_fair) { | ||
| if result.unparked_threads != 0 && force_fair { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is very different in the unlock_fair case so it would be better to have a separate unlock_fair_slow method.
|
|
||
| #[cold] | ||
| fn lock_slow(&self, timeout: Option<Instant>) -> bool { | ||
| fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn lock_slow(&self, timeout: Option<Instant>, in_contention: bool) -> bool { | |
| fn lock_slow(&self, timeout: Option<Instant>, set_parked_bit: bool) -> bool { |
I think set_parked_bit is a clearer name for this.
| state | LOCKED_BIT | extra_flags, | ||
| Ordering::Acquire, | ||
| Ordering::Relaxed, | ||
| ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spin loop on line 233 should be disabled if set_parked_bit since there are actually parked threads even though the bit isn't set.
| } else { | ||
| mutex.lock(); | ||
| match result { | ||
| ParkResult::Unparked(TOKEN_HANDOFF) => unreachable!("can't be handed off"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOKEN_HANDOFF is actually reachable if we are requeued onto a mutex and then another unlocks that mutex with unlock_fair.
| // thread directly without unlocking it. | ||
| pub(crate) const TOKEN_HANDOFF: UnparkToken = UnparkToken(1); | ||
|
|
||
| // UnparkToken used to indicate that the waiter should restore PARKED_BIT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // UnparkToken used to indicate that the waiter should restore PARKED_BIT. | |
| // UnparkToken used to indicate that the waiter should restore PARKED_BIT and then continue attempting to acquire the mutex. |
| // This thread doesn't sleep, so it's not sure whether it's the last thread | ||
| // in queue. Setting PARKED_BIT can lead to false wake up. But false wake up | ||
| // is good for throughput during high contention. | ||
| ParkResult::Invalid => extra_flags = PARKED_BIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct: we should only set PARKED_BIT if we are required to by an UnparkToken or if the current thread is about to park. Otherwise this just causes unnecessary work.
What numbers do you get on the benchmark without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point here is to ask for more wake. During high contention, random wake up may make more threads stay on CPU. Because high contention means the lock is acquired and released very frequently, more on CPU time means higher possibility to acquire the lock. Leaving CPU and then being scheduled back up one by one is very slow, we should do that only when there is probably no way to make progress anytime soon.
This is also why I name the new arg as in_contention instead of set park bit to highlight that park bit makes more sense as contention than parking.
When thread count is more than 9, the number can be lower than 30% ~ 40% without setting the bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main effect that setting the parked bit has is that it prevents threads from spinning (since we only spin when the parked bit is clear). This has the effect of causing threads to go directly to parking, which as you said is quite slow. However since other threads are no longer actively trying to acquire the lock, it means that one thread can quickly acquire and release the lock since there is no cache interference from other threads.
Although this may look good on benchmarks, it actually isn't good since other threads are wasting time doing work that isn't useful instead of attempting to acquire the lock. This is effectively equivalent to just pausing for a longer period between attempts to acquire the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the effect of causing threads to go directly to parking, which as you said is quite slow
Perf stats shows that setting the PARKED_BIT here can lead to more context switch and a lot higher cache miss, this is the prof that more threads are staying on CPU instead of going to sleep.
The reason why PARKED_BIT will wake more threads is because some thread will acquire lock without any competing during contention. For example, if thread a acquire lock, and thread b and c are waiting for a. When a release lock and wake thread b, another thread d that is on CPU right now may acquire lock earlier than thread b. There are two possible behavior of thread d, it can acquire lock directly, or it fails to try lock and try to park but fail again due to validation. Setting parked bit here is utilizing the second situation, so that when d acquire lock, it can still wake thread c later.
it actually isn't good since other threads are wasting time doing work that isn't useful instead of attempting to acquire the lock.
I notice this performance pitfall when I try to implement a linked list with mutex, which is short generally but can grow very long occasionally. After this PR, there is no obvious performance difference between pthread and parking_lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that situation, it's not thread D's job to set the parked bit: thread B will set it before parking itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread B may not as it may be still spinning to try lock.
This is an alternative more aggressive implementation of idea #461.
Compared to #461, this PR
Running
cargo run --bin mutex --release -- 9:36:9 5 5 2 2:Running with 9 threads
Running with 18 threads
Running with 27 threads
Running with 36 threads
Running
cargo run --bin rwlock --release -- 36 9 5 5 2 2Using lock-bench,
cargo run --release 32 2 10000 100: