`CINC`/`CSEL` not emitted inside the loops instead of jumps over single instruction blocks #96380

neon-sunset · 2024-01-01T10:06:04Z

Description

It appears that for a straightforward compare -> increment, .NET emits a branch instead of cinc when a method containing such code is inlined inside a loop body.

Analysis

Given a method RuneLength:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int RuneLength(in byte value)
{
    var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24));
    if (lzcnt is 0) lzcnt++;

    return (int)lzcnt;
}

This compiles to:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            ldrb    w0, [x0]
            lsl     w0, w0, #24
            mvn     w0, w0
            clz     w0, w0
            mov     w1, #1
            cmp     w0, #0
            csel    w0, w0, w1, ne ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero
 
G_M000_IG03:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 44

We can see that .NET emits mov + cmp + csel here. This isn't cinc but it's a good start.

However, if the method is inlined inside a loop body, the codegen is different.
Consider the below method:

static int Iterate(ref byte ptr, ref byte end)
{
    var acc = 0;
    while (Unsafe.IsAddressLessThan(ref ptr, ref end))
    {
        var length = RuneLength(in ptr);
        acc += length;
        ptr = ref Unsafe.Add(ref ptr, length);
    }

    return acc;
}

Instead of the pattern above, the codegen changes to cbnz label and mov 1 which is more compact but is more expensive because it uses branch execution units which have lower throughput per cycle than csel which uses integer units instead (on modern ARM cores like Firestorm or Cortex-X1):

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            mov     w2, wzr
            cmp     x0, x1
            bhs     G_M000_IG06
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
 
G_M000_IG03:                ;; offset=0x0014
            ldrb    w3, [x0]
            lsl     w3, w3, #24
            mvn     w3, w3
            clz     w3, w3
            cbnz    w3, G_M000_IG05 ;; <-- Could have been cmp + cinc
 
G_M000_IG04:                ;; offset=0x0028
            mov     w3, #1
 
G_M000_IG05:                ;; offset=0x002C
            add     w2, w2, w3
            sxtw    x3, w3
            add     x0, x0, x3
            cmp     x0, x1
            blo     G_M000_IG03
 
G_M000_IG06:                ;; offset=0x0040
            mov     w0, w2
 
G_M000_IG07:                ;; offset=0x0044
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 76

Configuration

.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  14.1
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/8.0.100/

Regression?

No

Happy New Year!

The text was updated successfully, but these errors were encountered:

ghost · 2024-01-01T10:06:14Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

It appears that for a straightforward compare -> increment, .NET emits a branch instead of cinc when a method containing such code is inlined.

When the method is not inlined, it emits mov #1, cmp and csel which can also be improved.

Analysis

Given a method RuneLength:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int RuneLength(in byte value)
{
    var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24));
    if (lzcnt is 0) lzcnt++;

    return (int)lzcnt;
}

This compiles to:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            ldrb    w0, [x0]
            lsl     w0, w0, #24
            mvn     w0, w0
            clz     w0, w0
            mov     w1, #1
            cmp     w0, #0
            csel    w0, w0, w1, ne ;; ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero
 
G_M000_IG03:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 44

We can see that .NET emits mov + cmp + csel here. This isn't cinc but it's a good start.

However, if the method is inlined, the codegen is different.
Consider the below method:

static int Iterate(ref byte ptr, ref byte end)
{
    var acc = 0;
    while (Unsafe.IsAddressLessThan(ref ptr, ref end))
    {
        var length = RuneLength(in ptr);
        acc += length;
        ptr = ref Unsafe.Add(ref ptr, length);
    }

    return acc;
}

Instead of the pattern above, the codegen changes to cbnz label and mov 1 which is more compact but is more expensive if the loop body has other branches or if the pattern is not predictable:

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            mov     w2, wzr
            cmp     x0, x1
            bhs     G_M000_IG06
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
 
G_M000_IG03:                ;; offset=0x0014
            ldrb    w3, [x0]
            lsl     w3, w3, #24
            mvn     w3, w3
            clz     w3, w3
            cbnz    w3, G_M000_IG05 ;; <-- Could have been cmp + cinc
 
G_M000_IG04:                ;; offset=0x0028
            mov     w3, #1
 
G_M000_IG05:                ;; offset=0x002C
            add     w2, w2, w3
            sxtw    x3, w3
            add     x0, x0, x3
            cmp     x0, x1
            blo     G_M000_IG03
 
G_M000_IG06:                ;; offset=0x0040
            mov     w0, w2
 
G_M000_IG07:                ;; offset=0x0044
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 76

Configuration

.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  14.1
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/8.0.100/

Regression?

No

Thanks!

Author:	neon-sunset
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

jakobbotsch · 2024-01-01T10:26:39Z

We do not emit conditional select instructions inside loops today because we do not have the necessary framework in place to make a proper prediction on whether that is profitable or not. Branches can (and do in real world examples) outperform conditional select instructions due to branch prediction. See #55364 (comment) for more details.

neon-sunset · 2024-01-01T11:19:02Z

This might be true for older ARM cores but not necessarily for the new ones. For example, Firestorm (2020) and newer cores on osx-arm64 have reciproc. TP for cinc of 0.333 - higher than for branches which are at 0.5 for not taken and just 1 for taken. Which means that in this instance it should be in theory unconditionally cheaper (at equal codegen size).
https://dougallj.github.io/applecpu/measurements/firestorm/CINC_64.html

Cortex-X1 software optimization guide too describes throughput at 4 conditional selects/compares per cycle vs 2 branch instructions.
https://documentation-service.arm.com/static/6238b0f88804d00769e9dee6

Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:

Blocks under conditionals span multiple (not trivially cheap) instructions
Loop body has only single taken or two not taken branches
Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal

Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.

jakobbotsch · 2024-01-01T15:06:15Z

Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:

Blocks under conditionals span multiple (not trivially cheap) instructions

Loop body has only single taken or two not taken branches

Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal

Another important consideration is the dependency chain the conditional instruction is involved in and whether or not it is on the critical path of the loop. We do not currently have something in place to evaluate that.

Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.

Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion).

Note that @a74nh works for Arm and originally implemented if-conversion within RyuJIT. He also did some experiments in dotnet/performance#3078.

neon-sunset · 2024-01-01T15:39:39Z

Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion).

That is true, thanks. In this issue there's also note regarding not emitting cmp, cinc over mov, cmp, csel. Do you want me to submit a separate issue or does this one suffice?

jakobbotsch · 2024-01-01T15:57:16Z

Feel free to open a separate issue for it. Hopefully I can take a look soon. We should be able to emit csinc since #91262, so I'm not sure what goes wrong here. Even the following does not produce csinc:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
    int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
    return lzcnt == 0 ? lzcnt + 1 : lzcnt;
}

yet this (incorrect) rewrite does:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
    int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
    return lzcnt == 0 ? lzcnt : lzcnt + 1;
}

I think it should be a simple fix.

neon-sunset · 2024-01-03T10:36:23Z

@jakobbotsch I have updated the issue to reflect the possible improvement in the loop vs separate one for the cinc/csel form. Thank you for looking into this!

jakobbotsch · 2024-07-31T16:14:28Z

Won't get to this one in 9.0.

neon-sunset · 2025-01-09T14:44:20Z

Another example (CSEL): https://godbolt.org/z/nv5WEv5Tv (yes, it's not an optimal implementation but code like this, pointers notwithstanding, is pretty common)

It would be nice if this loop restriction could have been revisited eventually, at least for trivial "up to n if-converted nodes", especially since Intel are also contributing their own APX variants that will also benefit from this.

And if this has substantial enough impact, it's going to be seen on the new Cobalt-based runners (unless I'm misremembering?).

Thanks.

neon-sunset added the tenet-performance Performance related issue label Jan 1, 2024

ghost added the untriaged New issue has not been triaged by the area owner label Jan 1, 2024

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 1, 2024

neon-sunset mentioned this issue Jan 3, 2024

[ARM64] CINC or CSINC is not emitted for conditional increment #96441

Open

neon-sunset changed the title ~~[ARM64] CINC or at least CSEL not emitted for conditional increment~~ [ARM64] CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks Jan 3, 2024

JulieLeeMSFT assigned jakobbotsch Jan 4, 2024

JulieLeeMSFT added this to the 9.0.0 milestone Jan 4, 2024

ghost removed the untriaged New issue has not been triaged by the area owner label Jan 4, 2024

neon-sunset mentioned this issue Jan 7, 2024

JIT: CMOV aka CMOVcc not generated when method is inlined. #96593

Closed

neon-sunset changed the title ~~[ARM64] CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks~~ CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks Jan 7, 2024

neon-sunset mentioned this issue Mar 19, 2024

[Meta] Tracked Issues and References U8String/U8String#46

Open

jakobbotsch modified the milestones: 9.0.0, 10.0.0 Jul 31, 2024

jakobbotsch mentioned this issue Nov 19, 2024

cmov not being used in nested hand-unrolled loop #109940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`CINC`/`CSEL` not emitted inside the loops instead of jumps over single instruction blocks #96380

`CINC`/`CSEL` not emitted inside the loops instead of jumps over single instruction blocks #96380

neon-sunset commented Jan 1, 2024 •

edited

Loading

ghost commented Jan 1, 2024

Description

Analysis

Configuration

Regression?

jakobbotsch commented Jan 1, 2024

neon-sunset commented Jan 1, 2024 •

edited

Loading

jakobbotsch commented Jan 1, 2024

neon-sunset commented Jan 1, 2024

jakobbotsch commented Jan 1, 2024 •

edited

Loading

neon-sunset commented Jan 3, 2024

jakobbotsch commented Jul 31, 2024

neon-sunset commented Jan 9, 2025 •

edited

Loading

CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks #96380

CINC/CSEL not emitted inside the loops instead of jumps over single instruction blocks #96380

Comments

neon-sunset commented Jan 1, 2024 • edited Loading

Description

Analysis

Configuration

Regression?

ghost commented Jan 1, 2024

Description

Analysis

Configuration

Regression?

jakobbotsch commented Jan 1, 2024

neon-sunset commented Jan 1, 2024 • edited Loading

jakobbotsch commented Jan 1, 2024

neon-sunset commented Jan 1, 2024

jakobbotsch commented Jan 1, 2024 • edited Loading

neon-sunset commented Jan 3, 2024

jakobbotsch commented Jul 31, 2024

neon-sunset commented Jan 9, 2025 • edited Loading

`CINC`/`CSEL` not emitted inside the loops instead of jumps over single instruction blocks #96380

`CINC`/`CSEL` not emitted inside the loops instead of jumps over single instruction blocks #96380

neon-sunset commented Jan 1, 2024 •

edited

Loading

neon-sunset commented Jan 1, 2024 •

edited

Loading

jakobbotsch commented Jan 1, 2024 •

edited

Loading

neon-sunset commented Jan 9, 2025 •

edited

Loading