Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize is_ascii for str and [u8] further #130733

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Sep 23, 2024

Replace the existing optimized function with one that enables auto-vectorization.

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

The resulting implementation is completely safe.

case00_libcore is the current implementation, case04_while_loop is this PR.

benchmarks:
    ascii::is_ascii_slice::long::case00_libcore                             22.25/iter  +/- 1.09
    ascii::is_ascii_slice::long::case04_while_loop                           6.78/iter  +/- 0.92
    ascii::is_ascii_slice::medium::case00_libcore                            2.81/iter  +/- 0.39
    ascii::is_ascii_slice::medium::case04_while_loop                         1.56/iter  +/- 0.78
    ascii::is_ascii_slice::short::case00_libcore                             5.55/iter  +/- 0.85
    ascii::is_ascii_slice::short::case04_while_loop                          3.75/iter  +/- 0.22
    ascii::is_ascii_slice::unaligned_both_long::case00_libcore              26.59/iter  +/- 0.66
    ascii::is_ascii_slice::unaligned_both_long::case04_while_loop            5.78/iter  +/- 0.16
    ascii::is_ascii_slice::unaligned_both_medium::case00_libcore             2.97/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop          2.41/iter  +/- 0.10
    ascii::is_ascii_slice::unaligned_head_long::case00_libcore              23.71/iter  +/- 0.79
    ascii::is_ascii_slice::unaligned_head_long::case04_while_loop            7.83/iter  +/- 1.31
    ascii::is_ascii_slice::unaligned_head_medium::case00_libcore             3.69/iter  +/- 0.54
    ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop          7.05/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_tail_long::case00_libcore              24.44/iter  +/- 1.41
    ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop            5.12/iter  +/- 0.18
    ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore             3.24/iter  +/- 0.40
    ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop          2.86/iter  +/- 0.14

unaligned_head_medium is the main regression in the benchmarks. It is a 32 byte string being sliced bytes[1..].

The first commit can be used to run the benchmarks against the current core implementation.

Previous implementation was done in #74066


Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36

@rustbot
Copy link
Collaborator

rustbot commented Sep 23, 2024

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 23, 2024
@okaneco
Copy link
Contributor Author

okaneco commented Sep 23, 2024

I tried to rename the benchmark file to is_ascii_slice.rs but Github didn't like that and said it was a merge conflict. I didn't know how to resolve that so I dropped the commit.

@workingjubilee
Copy link
Member

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

Are you taking into account that this can cause contention over a single execution port whereas the current implementation may be able to execute multiple checks at once in the same cycles, despite having more apparent instructions?

@workingjubilee
Copy link
Member

apparently some more recent CPUs have more ports for pmovmskb, interestingly, so they should indeed go significantly faster.

// Process the remaining `bytes.len() % N` bytes.
let mut is_ascii = true;
while i < bytes.len() {
is_ascii &= bytes[i] <= 127;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts. The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

    // Process the remaining `bytes.len() % N` bytes.
    while i < bytes.len() {
        if bytes[i] > 127 {
            return false;
        }
    }

    true

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

    // Constant chosen to enable `pmovmskb` instruction on x86-64
    const N1: usize = 32;
    const N2: usize = 8; // or 16

    let mut i = 0;

    while i + N1 <= bytes.len() {
        let chunk_end = i + N1;

        let mut count = 0;
        while i < chunk_end {
            count += (bytes[i] <= 127) as u8;
            i += 1;
        }

        // All bytes should be <= 127 so count is equal to chunk size.
        if count != N1 as u8 {
            return false;
        }
    }

    while i + N2 <= bytes.len() {
        let chunk_end = i + N2;

        let mut count = 0;
        while i < chunk_end {
            count += (bytes[i] <= 127) as u8;
            i += 1;
        }

        // All bytes should be <= 127 so count is equal to chunk size.
        if count != N2 as u8 {
            return false;
        }
    }

    // Process the remaining `bytes.len() % N` bytes.
    while i < bytes.len() {
        if bytes[i] > 127 {
            return false;
        }
    }

    true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts.

I reported this when I had the time to #130795

The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

This ended up being a slowdown, also note it's missing the i += 1; for anyone wanting to experiment with this.
I made that mistake refactoring a few times trying out different loops.

I tried slice::from_raw_parts on the tail to use the is_ascii_simple function above and that was slower too. It seemed faster to just eat the extra unrolled code since I couldn't get a scalar remainder loop construction that was as fast.

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

I tried a few permutations of that, and reductions of the constant like 16, 8, 4, 2, 1. The second loop of 8 seemed to perform the best on the benchmark suite, but only performed markedly better on the unaligned_both_medium and unaligned_tail_medium cases.

ascii::is_ascii_slice::long::case00_libcore                          24.39/iter +/- 1.16
ascii::is_ascii_slice::long::case04_while_loop_32                     8.14/iter +/- 0.80
ascii::is_ascii_slice::long::case05_while_loop_32_8                   9.28/iter +/- 1.17
ascii::is_ascii_slice::medium::case00_libcore                         2.93/iter +/- 0.06
ascii::is_ascii_slice::medium::case04_while_loop_32                   1.64/iter +/- 0.30
ascii::is_ascii_slice::medium::case05_while_loop_32_8                 1.86/iter +/- 0.09
ascii::is_ascii_slice::short::case00_libcore                          5.34/iter +/- 0.99
ascii::is_ascii_slice::short::case04_while_loop_32                    3.91/iter +/- 0.13
ascii::is_ascii_slice::short::case05_while_loop_32_8                  3.82/iter +/- 0.11
ascii::is_ascii_slice::unaligned_both_long::case00_libcore           23.13/iter +/- 1.18
ascii::is_ascii_slice::unaligned_both_long::case04_while_loop_32      5.29/iter +/- 0.26
ascii::is_ascii_slice::unaligned_both_long::case05_while_loop_32_8    5.31/iter +/- 0.18
ascii::is_ascii_slice::unaligned_both_medium::case00_libcore          3.04/iter +/- 0.35
ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop_32    2.78/iter +/- 0.08
ascii::is_ascii_slice::unaligned_both_medium::case05_while_loop_32_8  1.62/iter +/- 0.10
ascii::is_ascii_slice::unaligned_head_long::case00_libcore           23.03/iter +/- 2.51
ascii::is_ascii_slice::unaligned_head_long::case04_while_loop_32      8.04/iter +/- 0.24
ascii::is_ascii_slice::unaligned_head_long::case05_while_loop_32_8   13.04/iter +/- 0.34
ascii::is_ascii_slice::unaligned_head_medium::case00_libcore          3.78/iter +/- 0.52
ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop_32    6.86/iter +/- 0.61
ascii::is_ascii_slice::unaligned_head_medium::case05_while_loop_32_8  8.39/iter +/- 0.16
ascii::is_ascii_slice::unaligned_tail_long::case00_libcore           24.26/iter +/- 0.22
ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop_32      4.54/iter +/- 0.15
ascii::is_ascii_slice::unaligned_tail_long::case05_while_loop_32_8    4.69/iter +/- 0.05
ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore          2.89/iter +/- 0.40
ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop_32    2.63/iter +/- 0.09
ascii::is_ascii_slice::unaligned_tail_medium::case05_while_loop_32_8  1.98/iter +/- 0.13

@the8472
Copy link
Member

the8472 commented Sep 23, 2024

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

@okaneco
Copy link
Contributor Author

okaneco commented Sep 27, 2024

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

I've reinstated the SWAR loop.

I know str::contains has an optimized path for SSE2.
Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

I tried integrating this fast path into the current function and then using slice::from_raw_parts to handle the tail with SWAR but the speed was 2-4x slower than the new standalone function. So I'm not sure it's worth making the SWAR function any more complicated than it is, unless there are other tricks that can be used.

fn is_ascii(mut s: &[u8]) -> bool {
    #[cfg(...)]
    {
        let mut i = 0;
        while i + 32 <= bytes.len() {}
        let new_len = s.len() - i;
        s = crate::slice::from_raw_parts(s.as_ptr().add(i), new_len);
    }
    // proceed with SWAR
}

@the8472
Copy link
Member

the8472 commented Sep 28, 2024

Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

If the speedup requires platform-specific code or shaping the code to benefit x86 would negatively impact other platforms then that's acceptable, yes.

@bors
Copy link
Contributor

bors commented Nov 4, 2024

☔ The latest upstream changes (presumably #132586) made this pull request unmergeable. Please resolve the merge conflicts.

Add LONG benchmarks for more comparison between the methods
The new `is_ascii` function is optimized to use the
`pmovmskb` vector instruction which tests the high bit in a lane.
This corresponds to the same check of whether a byte is ASCII so
ASCII validity checking can be vectorized. This instruction
does not exist on other platforms so it is likely to regress performance
and is gated to all(target_arch = "x86_64", target_feature = "sse2").

Add codegen test
Remove crate::mem import for functions included in the prelude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants