-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize is_ascii
for str
and [u8]
further
#130733
base: master
Are you sure you want to change the base?
Conversation
I tried to rename the benchmark file to |
Are you taking into account that this can cause contention over a single execution port whereas the current implementation may be able to execute multiple checks at once in the same cycles, despite having more apparent instructions? |
apparently some more recent CPUs have more ports for pmovmskb, interestingly, so they should indeed go significantly faster. |
library/core/src/slice/ascii.rs
Outdated
// Process the remaining `bytes.len() % N` bytes. | ||
let mut is_ascii = true; | ||
while i < bytes.len() { | ||
is_ascii &= bytes[i] <= 127; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts. The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:
// Process the remaining `bytes.len() % N` bytes.
while i < bytes.len() {
if bytes[i] > 127 {
return false;
}
}
true
If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:
// Constant chosen to enable `pmovmskb` instruction on x86-64
const N1: usize = 32;
const N2: usize = 8; // or 16
let mut i = 0;
while i + N1 <= bytes.len() {
let chunk_end = i + N1;
let mut count = 0;
while i < chunk_end {
count += (bytes[i] <= 127) as u8;
i += 1;
}
// All bytes should be <= 127 so count is equal to chunk size.
if count != N1 as u8 {
return false;
}
}
while i + N2 <= bytes.len() {
let chunk_end = i + N2;
let mut count = 0;
while i < chunk_end {
count += (bytes[i] <= 127) as u8;
i += 1;
}
// All bytes should be <= 127 so count is equal to chunk size.
if count != N2 as u8 {
return false;
}
}
// Process the remaining `bytes.len() % N` bytes.
while i < bytes.len() {
if bytes[i] > 127 {
return false;
}
}
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts.
I reported this when I had the time to #130795
The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:
This ended up being a slowdown, also note it's missing the i += 1;
for anyone wanting to experiment with this.
I made that mistake refactoring a few times trying out different loops.
I tried slice::from_raw_parts
on the tail to use the is_ascii_simple
function above and that was slower too. It seemed faster to just eat the extra unrolled code since I couldn't get a scalar remainder loop construction that was as fast.
If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:
I tried a few permutations of that, and reductions of the constant like 16, 8, 4, 2, 1. The second loop of 8 seemed to perform the best on the benchmark suite, but only performed markedly better on the unaligned_both_medium
and unaligned_tail_medium
cases.
ascii::is_ascii_slice::long::case00_libcore 24.39/iter +/- 1.16
ascii::is_ascii_slice::long::case04_while_loop_32 8.14/iter +/- 0.80
ascii::is_ascii_slice::long::case05_while_loop_32_8 9.28/iter +/- 1.17
ascii::is_ascii_slice::medium::case00_libcore 2.93/iter +/- 0.06
ascii::is_ascii_slice::medium::case04_while_loop_32 1.64/iter +/- 0.30
ascii::is_ascii_slice::medium::case05_while_loop_32_8 1.86/iter +/- 0.09
ascii::is_ascii_slice::short::case00_libcore 5.34/iter +/- 0.99
ascii::is_ascii_slice::short::case04_while_loop_32 3.91/iter +/- 0.13
ascii::is_ascii_slice::short::case05_while_loop_32_8 3.82/iter +/- 0.11
ascii::is_ascii_slice::unaligned_both_long::case00_libcore 23.13/iter +/- 1.18
ascii::is_ascii_slice::unaligned_both_long::case04_while_loop_32 5.29/iter +/- 0.26
ascii::is_ascii_slice::unaligned_both_long::case05_while_loop_32_8 5.31/iter +/- 0.18
ascii::is_ascii_slice::unaligned_both_medium::case00_libcore 3.04/iter +/- 0.35
ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop_32 2.78/iter +/- 0.08
ascii::is_ascii_slice::unaligned_both_medium::case05_while_loop_32_8 1.62/iter +/- 0.10
ascii::is_ascii_slice::unaligned_head_long::case00_libcore 23.03/iter +/- 2.51
ascii::is_ascii_slice::unaligned_head_long::case04_while_loop_32 8.04/iter +/- 0.24
ascii::is_ascii_slice::unaligned_head_long::case05_while_loop_32_8 13.04/iter +/- 0.34
ascii::is_ascii_slice::unaligned_head_medium::case00_libcore 3.78/iter +/- 0.52
ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop_32 6.86/iter +/- 0.61
ascii::is_ascii_slice::unaligned_head_medium::case05_while_loop_32_8 8.39/iter +/- 0.16
ascii::is_ascii_slice::unaligned_tail_long::case00_libcore 24.26/iter +/- 0.22
ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop_32 4.54/iter +/- 0.15
ascii::is_ascii_slice::unaligned_tail_long::case05_while_loop_32_8 4.69/iter +/- 0.05
ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore 2.89/iter +/- 0.40
ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop_32 2.63/iter +/- 0.09
ascii::is_ascii_slice::unaligned_tail_medium::case05_while_loop_32_8 1.98/iter +/- 0.13
The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86. |
I've reinstated the SWAR loop. I know I tried integrating this fast path into the current function and then using fn is_ascii(mut s: &[u8]) -> bool {
#[cfg(...)]
{
let mut i = 0;
while i + 32 <= bytes.len() {}
let new_len = s.len() - i;
s = crate::slice::from_raw_parts(s.as_ptr().add(i), new_len);
}
// proceed with SWAR
} |
If the speedup requires platform-specific code or shaping the code to benefit x86 would negatively impact other platforms then that's acceptable, yes. |
☔ The latest upstream changes (presumably #132586) made this pull request unmergeable. Please resolve the merge conflicts. |
Add LONG benchmarks for more comparison between the methods
The new `is_ascii` function is optimized to use the `pmovmskb` vector instruction which tests the high bit in a lane. This corresponds to the same check of whether a byte is ASCII so ASCII validity checking can be vectorized. This instruction does not exist on other platforms so it is likely to regress performance and is gated to all(target_arch = "x86_64", target_feature = "sse2"). Add codegen test Remove crate::mem import for functions included in the prelude
Replace the existing optimized function with one that enables auto-vectorization.
This is especially beneficial on x86-64 as
pmovmskb
can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the currentusize
at a time check.The resulting implementation is completely safe.
case00_libcore
is the current implementation,case04_while_loop
is this PR.unaligned_head_medium
is the main regression in the benchmarks. It is a 32 byte string being slicedbytes[1..]
.The first commit can be used to run the benchmarks against the current core implementation.
Previous implementation was done in #74066
Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36