Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks #96380

Open
neon-sunset opened this issue Jan 1, 2024 · 9 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@neon-sunset
Copy link
Contributor

neon-sunset commented Jan 1, 2024

Description

It appears that for a straightforward compare -> increment, .NET emits a branch instead of cinc when a method containing such code is inlined inside a loop body.

Analysis

Given a method RuneLength:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int RuneLength(in byte value)
{
    var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24));
    if (lzcnt is 0) lzcnt++;

    return (int)lzcnt;
}

This compiles to:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            ldrb    w0, [x0]
            lsl     w0, w0, #24
            mvn     w0, w0
            clz     w0, w0
            mov     w1, #1
            cmp     w0, #0
            csel    w0, w0, w1, ne ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero
 
G_M000_IG03:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 44

We can see that .NET emits mov + cmp + csel here. This isn't cinc but it's a good start.

However, if the method is inlined inside a loop body, the codegen is different.
Consider the below method:

static int Iterate(ref byte ptr, ref byte end)
{
    var acc = 0;
    while (Unsafe.IsAddressLessThan(ref ptr, ref end))
    {
        var length = RuneLength(in ptr);
        acc += length;
        ptr = ref Unsafe.Add(ref ptr, length);
    }

    return acc;
}

Instead of the pattern above, the codegen changes to cbnz label and mov 1 which is more compact but is more expensive because it uses branch execution units which have lower throughput per cycle than csel which uses integer units instead (on modern ARM cores like Firestorm or Cortex-X1):

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            mov     w2, wzr
            cmp     x0, x1
            bhs     G_M000_IG06
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
 
G_M000_IG03:                ;; offset=0x0014
            ldrb    w3, [x0]
            lsl     w3, w3, #24
            mvn     w3, w3
            clz     w3, w3
            cbnz    w3, G_M000_IG05 ;; <-- Could have been cmp + cinc
 
G_M000_IG04:                ;; offset=0x0028
            mov     w3, #1
 
G_M000_IG05:                ;; offset=0x002C
            add     w2, w2, w3
            sxtw    x3, w3
            add     x0, x0, x3
            cmp     x0, x1
            blo     G_M000_IG03
 
G_M000_IG06:                ;; offset=0x0040
            mov     w0, w2
 
G_M000_IG07:                ;; offset=0x0044
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 76

Configuration

.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  14.1
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/8.0.100/

Regression?

No

Happy New Year!

@neon-sunset neon-sunset added the tenet-performance Performance related issue label Jan 1, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 1, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 1, 2024
@ghost
Copy link

ghost commented Jan 1, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

It appears that for a straightforward compare -> increment, .NET emits a branch instead of cinc when a method containing such code is inlined.

When the method is not inlined, it emits mov #1, cmp and csel which can also be improved.

Analysis

Given a method RuneLength:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int RuneLength(in byte value)
{
    var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24));
    if (lzcnt is 0) lzcnt++;

    return (int)lzcnt;
}

This compiles to:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            ldrb    w0, [x0]
            lsl     w0, w0, #24
            mvn     w0, w0
            clz     w0, w0
            mov     w1, #1
            cmp     w0, #0
            csel    w0, w0, w1, ne ;; ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero
 
G_M000_IG03:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 44

We can see that .NET emits mov + cmp + csel here. This isn't cinc but it's a good start.

However, if the method is inlined, the codegen is different.
Consider the below method:

static int Iterate(ref byte ptr, ref byte end)
{
    var acc = 0;
    while (Unsafe.IsAddressLessThan(ref ptr, ref end))
    {
        var length = RuneLength(in ptr);
        acc += length;
        ptr = ref Unsafe.Add(ref ptr, length);
    }

    return acc;
}

Instead of the pattern above, the codegen changes to cbnz label and mov 1 which is more compact but is more expensive if the loop body has other branches or if the pattern is not predictable:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            mov     w2, wzr
            cmp     x0, x1
            bhs     G_M000_IG06
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
 
G_M000_IG03:                ;; offset=0x0014
            ldrb    w3, [x0]
            lsl     w3, w3, #24
            mvn     w3, w3
            clz     w3, w3
            cbnz    w3, G_M000_IG05 ;; <-- Could have been cmp + cinc
 
G_M000_IG04:                ;; offset=0x0028
            mov     w3, #1
 
G_M000_IG05:                ;; offset=0x002C
            add     w2, w2, w3
            sxtw    x3, w3
            add     x0, x0, x3
            cmp     x0, x1
            blo     G_M000_IG03
 
G_M000_IG06:                ;; offset=0x0040
            mov     w0, w2
 
G_M000_IG07:                ;; offset=0x0044
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 76

Configuration

.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  14.1
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/8.0.100/

Regression?

No

Thanks!

Author: neon-sunset
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@jakobbotsch
Copy link
Member

We do not emit conditional select instructions inside loops today because we do not have the necessary framework in place to make a proper prediction on whether that is profitable or not. Branches can (and do in real world examples) outperform conditional select instructions due to branch prediction. See #55364 (comment) for more details.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 1, 2024

This might be true for older ARM cores but not necessarily for the new ones. For example, Firestorm (2020) and newer cores on osx-arm64 have reciproc. TP for cinc of 0.333 - higher than for branches which are at 0.5 for not taken and just 1 for taken. Which means that in this instance it should be in theory unconditionally cheaper (at equal codegen size).
https://dougallj.github.io/applecpu/measurements/firestorm/CINC_64.html

Cortex-X1 software optimization guide too describes throughput at 4 conditional selects/compares per cycle vs 2 branch instructions.
https://documentation-service.arm.com/static/6238b0f88804d00769e9dee6

Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:

  • Blocks under conditionals span multiple (not trivially cheap) instructions
  • Loop body has only single taken or two not taken branches
  • Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal

Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.

@jakobbotsch
Copy link
Member

Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:

  • Blocks under conditionals span multiple (not trivially cheap) instructions
  • Loop body has only single taken or two not taken branches
  • Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal

Another important consideration is the dependency chain the conditional instruction is involved in and whether or not it is on the critical path of the loop. We do not currently have something in place to evaluate that.

Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.

Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion).

Note that @a74nh works for Arm and originally implemented if-conversion within RyuJIT. He also did some experiments in dotnet/performance#3078.

@neon-sunset
Copy link
Contributor Author

Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion).

That is true, thanks. In this issue there's also note regarding not emitting cmp, cinc over mov, cmp, csel. Do you want me to submit a separate issue or does this one suffice?

@jakobbotsch
Copy link
Member

jakobbotsch commented Jan 1, 2024

Feel free to open a separate issue for it. Hopefully I can take a look soon. We should be able to emit csinc since #91262, so I'm not sure what goes wrong here. Even the following does not produce csinc:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
    int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
    return lzcnt == 0 ? lzcnt + 1 : lzcnt;
}

yet this (incorrect) rewrite does:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
    int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
    return lzcnt == 0 ? lzcnt : lzcnt + 1;
}

I think it should be a simple fix.

@neon-sunset neon-sunset changed the title [ARM64] CINC or at least CSEL not emitted for conditional increment [ARM64] CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks Jan 3, 2024
@neon-sunset
Copy link
Contributor Author

@jakobbotsch I have updated the issue to reflect the possible improvement in the loop vs separate one for the cinc/csel form. Thank you for looking into this!

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jan 4, 2024
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jan 4, 2024
@neon-sunset neon-sunset changed the title [ARM64] CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks Jan 7, 2024
@jakobbotsch
Copy link
Member

Won't get to this one in 9.0.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 9, 2025

Another example (CSEL): https://godbolt.org/z/nv5WEv5Tv (yes, it's not an optimal implementation but code like this, pointers notwithstanding, is pretty common)

It would be nice if this loop restriction could have been revisited eventually, at least for trivial "up to n if-converted nodes", especially since Intel are also contributing their own APX variants that will also benefit from this.

And if this has substantial enough impact, it's going to be seen on the new Cobalt-based runners (unless I'm misremembering?).

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants