Optimize `is_ascii` for `str` and `[u8]` further #130733

okaneco · 2024-09-23T06:05:11Z

Replace the existing optimized function with one that enables auto-vectorization.

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

The resulting implementation is completely safe.

case00_libcore is the current implementation, case04_while_loop is this PR.

benchmarks:
    ascii::is_ascii_slice::long::case00_libcore                             22.25/iter  +/- 1.09
    ascii::is_ascii_slice::long::case04_while_loop                           6.78/iter  +/- 0.92
    ascii::is_ascii_slice::medium::case00_libcore                            2.81/iter  +/- 0.39
    ascii::is_ascii_slice::medium::case04_while_loop                         1.56/iter  +/- 0.78
    ascii::is_ascii_slice::short::case00_libcore                             5.55/iter  +/- 0.85
    ascii::is_ascii_slice::short::case04_while_loop                          3.75/iter  +/- 0.22
    ascii::is_ascii_slice::unaligned_both_long::case00_libcore              26.59/iter  +/- 0.66
    ascii::is_ascii_slice::unaligned_both_long::case04_while_loop            5.78/iter  +/- 0.16
    ascii::is_ascii_slice::unaligned_both_medium::case00_libcore             2.97/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop          2.41/iter  +/- 0.10
    ascii::is_ascii_slice::unaligned_head_long::case00_libcore              23.71/iter  +/- 0.79
    ascii::is_ascii_slice::unaligned_head_long::case04_while_loop            7.83/iter  +/- 1.31
    ascii::is_ascii_slice::unaligned_head_medium::case00_libcore             3.69/iter  +/- 0.54
    ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop          7.05/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_tail_long::case00_libcore              24.44/iter  +/- 1.41
    ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop            5.12/iter  +/- 0.18
    ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore             3.24/iter  +/- 0.40
    ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop          2.86/iter  +/- 0.14

unaligned_head_medium is the main regression in the benchmarks. It is a 32 byte string being sliced bytes[1..].

The first commit can be used to run the benchmarks against the current core implementation.

Previous implementation was done in #74066

Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36

rustbot · 2024-09-23T06:05:19Z

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

okaneco · 2024-09-23T06:20:41Z

I tried to rename the benchmark file to is_ascii_slice.rs but Github didn't like that and said it was a merge conflict. I didn't know how to resolve that so I dropped the commit.

workingjubilee · 2024-09-23T07:00:29Z

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

Are you taking into account that this can cause contention over a single execution port whereas the current implementation may be able to execute multiple checks at once in the same cycles, despite having more apparent instructions?

workingjubilee · 2024-09-23T07:49:56Z

apparently some more recent CPUs have more ports for pmovmskb, interestingly, so they should indeed go significantly faster.

jhorstmann · 2024-09-23T09:08:16Z

library/core/src/slice/ascii.rs

+    // Process the remaining `bytes.len() % N` bytes.
+    let mut is_ascii = true;
+    while i < bytes.len() {
+        is_ascii &= bytes[i] <= 127;


On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts. The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

// Process the remaining `bytes.len() % N` bytes. while i < bytes.len() { if bytes[i] > 127 { return false; } } true

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

// Constant chosen to enable `pmovmskb` instruction on x86-64 const N1: usize = 32; const N2: usize = 8; // or 16 let mut i = 0; while i + N1 <= bytes.len() { let chunk_end = i + N1; let mut count = 0; while i < chunk_end { count += (bytes[i] <= 127) as u8; i += 1; } // All bytes should be <= 127 so count is equal to chunk size. if count != N1 as u8 { return false; } } while i + N2 <= bytes.len() { let chunk_end = i + N2; let mut count = 0; while i < chunk_end { count += (bytes[i] <= 127) as u8; i += 1; } // All bytes should be <= 127 so count is equal to chunk size. if count != N2 as u8 { return false; } } // Process the remaining `bytes.len() % N` bytes. while i < bytes.len() { if bytes[i] > 127 { return false; } } true

On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts.

I reported this when I had the time to #130795

The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

This ended up being a slowdown, also note it's missing the i += 1; for anyone wanting to experiment with this.
I made that mistake refactoring a few times trying out different loops.

I tried slice::from_raw_parts on the tail to use the is_ascii_simple function above and that was slower too. It seemed faster to just eat the extra unrolled code since I couldn't get a scalar remainder loop construction that was as fast.

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

I tried a few permutations of that, and reductions of the constant like 16, 8, 4, 2, 1. The second loop of 8 seemed to perform the best on the benchmark suite, but only performed markedly better on the unaligned_both_medium and unaligned_tail_medium cases.

ascii::is_ascii_slice::long::case00_libcore 24.39/iter +/- 1.16 ascii::is_ascii_slice::long::case04_while_loop_32 8.14/iter +/- 0.80 ascii::is_ascii_slice::long::case05_while_loop_32_8 9.28/iter +/- 1.17 ascii::is_ascii_slice::medium::case00_libcore 2.93/iter +/- 0.06 ascii::is_ascii_slice::medium::case04_while_loop_32 1.64/iter +/- 0.30 ascii::is_ascii_slice::medium::case05_while_loop_32_8 1.86/iter +/- 0.09 ascii::is_ascii_slice::short::case00_libcore 5.34/iter +/- 0.99 ascii::is_ascii_slice::short::case04_while_loop_32 3.91/iter +/- 0.13 ascii::is_ascii_slice::short::case05_while_loop_32_8 3.82/iter +/- 0.11 ascii::is_ascii_slice::unaligned_both_long::case00_libcore 23.13/iter +/- 1.18 ascii::is_ascii_slice::unaligned_both_long::case04_while_loop_32 5.29/iter +/- 0.26 ascii::is_ascii_slice::unaligned_both_long::case05_while_loop_32_8 5.31/iter +/- 0.18 ascii::is_ascii_slice::unaligned_both_medium::case00_libcore 3.04/iter +/- 0.35 ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop_32 2.78/iter +/- 0.08 ascii::is_ascii_slice::unaligned_both_medium::case05_while_loop_32_8 1.62/iter +/- 0.10 ascii::is_ascii_slice::unaligned_head_long::case00_libcore 23.03/iter +/- 2.51 ascii::is_ascii_slice::unaligned_head_long::case04_while_loop_32 8.04/iter +/- 0.24 ascii::is_ascii_slice::unaligned_head_long::case05_while_loop_32_8 13.04/iter +/- 0.34 ascii::is_ascii_slice::unaligned_head_medium::case00_libcore 3.78/iter +/- 0.52 ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop_32 6.86/iter +/- 0.61 ascii::is_ascii_slice::unaligned_head_medium::case05_while_loop_32_8 8.39/iter +/- 0.16 ascii::is_ascii_slice::unaligned_tail_long::case00_libcore 24.26/iter +/- 0.22 ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop_32 4.54/iter +/- 0.15 ascii::is_ascii_slice::unaligned_tail_long::case05_while_loop_32_8 4.69/iter +/- 0.05 ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore 2.89/iter +/- 0.40 ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop_32 2.63/iter +/- 0.09 ascii::is_ascii_slice::unaligned_tail_medium::case05_while_loop_32_8 1.98/iter +/- 0.13

the8472 · 2024-09-23T19:58:41Z

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

okaneco · 2024-09-27T01:15:42Z

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

I've reinstated the SWAR loop.

I know str::contains has an optimized path for SSE2.
Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

I tried integrating this fast path into the current function and then using slice::from_raw_parts to handle the tail with SWAR but the speed was 2-4x slower than the new standalone function. So I'm not sure it's worth making the SWAR function any more complicated than it is, unless there are other tricks that can be used.

fn is_ascii(mut s: &[u8]) -> bool {
    #[cfg(...)]
    {
        let mut i = 0;
        while i + 32 <= bytes.len() {}
        let new_len = s.len() - i;
        s = crate::slice::from_raw_parts(s.as_ptr().add(i), new_len);
    }
    // proceed with SWAR
}

the8472 · 2024-09-28T22:49:47Z

Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

If the speedup requires platform-specific code or shaping the code to benefit x86 would negatively impact other platforms then that's acceptable, yes.

bors · 2024-11-04T09:53:49Z

☔ The latest upstream changes (presumably #132586) made this pull request unmergeable. Please resolve the merge conflicts.

Add LONG benchmarks for more comparison between the methods

The new `is_ascii` function is optimized to use the `pmovmskb` vector instruction which tests the high bit in a lane. This corresponds to the same check of whether a byte is ASCII so ASCII validity checking can be vectorized. This instruction does not exist on other platforms so it is likely to regress performance and is gated to all(target_arch = "x86_64", target_feature = "sse2"). Add codegen test Remove crate::mem import for functions included in the prelude

rustbot assigned scottmcm Sep 23, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 23, 2024

okaneco force-pushed the is_ascii branch from bc78676 to bf7bb44 Compare September 23, 2024 06:18

jhorstmann reviewed Sep 23, 2024

View reviewed changes

okaneco force-pushed the is_ascii branch from bf7bb44 to 44ccc3b Compare September 27, 2024 01:15

okaneco added 2 commits November 5, 2024 22:34

Add new implementation benchmark

d7d67ad

Add LONG benchmarks for more comparison between the methods

okaneco force-pushed the is_ascii branch from 44ccc3b to 1b5c02b Compare November 6, 2024 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `is_ascii` for `str` and `[u8]` further #130733

Optimize `is_ascii` for `str` and `[u8]` further #130733

okaneco commented Sep 23, 2024 •

edited

Loading

rustbot commented Sep 23, 2024

okaneco commented Sep 23, 2024 •

edited

Loading

workingjubilee commented Sep 23, 2024

workingjubilee commented Sep 23, 2024

jhorstmann Sep 23, 2024

okaneco Sep 27, 2024

the8472 commented Sep 23, 2024

okaneco commented Sep 27, 2024

the8472 commented Sep 28, 2024

bors commented Nov 4, 2024

Optimize is_ascii for str and [u8] further #130733

Are you sure you want to change the base?

Optimize is_ascii for str and [u8] further #130733

Conversation

okaneco commented Sep 23, 2024 • edited Loading

rustbot commented Sep 23, 2024

okaneco commented Sep 23, 2024 • edited Loading

workingjubilee commented Sep 23, 2024

workingjubilee commented Sep 23, 2024

jhorstmann Sep 23, 2024

Choose a reason for hiding this comment

okaneco Sep 27, 2024

Choose a reason for hiding this comment

the8472 commented Sep 23, 2024

okaneco commented Sep 27, 2024

the8472 commented Sep 28, 2024

bors commented Nov 4, 2024

Optimize `is_ascii` for `str` and `[u8]` further #130733

Optimize `is_ascii` for `str` and `[u8]` further #130733

okaneco commented Sep 23, 2024 •

edited

Loading

okaneco commented Sep 23, 2024 •

edited

Loading