-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CINC
/CSEL
not emitted inside the loops instead of jumps over single instruction blocks
#96380
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionIt appears that for a straightforward compare -> increment, .NET emits a branch instead of When the method is not inlined, it emits AnalysisGiven a method [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int RuneLength(in byte value)
{
var lzcnt = (uint)BitOperations.LeadingZeroCount(~((uint)value << 24));
if (lzcnt is 0) lzcnt++;
return (int)lzcnt;
} This compiles to: G_M000_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
G_M000_IG02: ;; offset=0x0008
ldrb w0, [x0]
lsl w0, w0, #24
mvn w0, w0
clz w0, w0
mov w1, #1
cmp w0, #0
csel w0, w0, w1, ne ;; ;; <-- Could have been just cinc with mov 1 elided - we're incrementing zero
G_M000_IG03: ;; offset=0x0024
ldp fp, lr, [sp], #0x10
ret lr
; Total bytes of code 44 We can see that .NET emits However, if the method is inlined, the codegen is different. static int Iterate(ref byte ptr, ref byte end)
{
var acc = 0;
while (Unsafe.IsAddressLessThan(ref ptr, ref end))
{
var length = RuneLength(in ptr);
acc += length;
ptr = ref Unsafe.Add(ref ptr, length);
}
return acc;
} Instead of the pattern above, the codegen changes to G_M000_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
G_M000_IG02: ;; offset=0x0008
mov w2, wzr
cmp x0, x1
bhs G_M000_IG06
align [0 bytes for IG03]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG03: ;; offset=0x0014
ldrb w3, [x0]
lsl w3, w3, #24
mvn w3, w3
clz w3, w3
cbnz w3, G_M000_IG05 ;; <-- Could have been cmp + cinc
G_M000_IG04: ;; offset=0x0028
mov w3, #1
G_M000_IG05: ;; offset=0x002C
add w2, w2, w3
sxtw x3, w3
add x0, x0, x3
cmp x0, x1
blo G_M000_IG03
G_M000_IG06: ;; offset=0x0040
mov w0, w2
G_M000_IG07: ;; offset=0x0044
ldp fp, lr, [sp], #0x10
ret lr
; Total bytes of code 76 Configuration
Regression?No Thanks!
|
We do not emit conditional select instructions inside loops today because we do not have the necessary framework in place to make a proper prediction on whether that is profitable or not. Branches can (and do in real world examples) outperform conditional select instructions due to branch prediction. See #55364 (comment) for more details. |
This might be true for older ARM cores but not necessarily for the new ones. For example, Firestorm (2020) and newer cores on Cortex-X1 software optimization guide too describes throughput at 4 conditional selects/compares per cycle vs 2 branch instructions. Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:
Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in. |
Another important consideration is the dependency chain the conditional instruction is involved in and whether or not it is on the critical path of the loop. We do not currently have something in place to evaluate that.
Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion). Note that @a74nh works for Arm and originally implemented if-conversion within RyuJIT. He also did some experiments in dotnet/performance#3078. |
That is true, thanks. In this issue there's also note regarding not emitting |
Feel free to open a separate issue for it. Hopefully I can take a look soon. We should be able to emit [MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
return lzcnt == 0 ? lzcnt + 1 : lzcnt;
} yet this (incorrect) rewrite does: [MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
return lzcnt == 0 ? lzcnt : lzcnt + 1;
} I think it should be a simple fix. |
CINC
or at least CSEL
not emitted for conditional incrementCINC
/CSEL
not emitted inside the loops instead of jumps over single instruction blocks
@jakobbotsch I have updated the issue to reflect the possible improvement in the loop vs separate one for the cinc/csel form. Thank you for looking into this! |
CINC
/CSEL
not emitted inside the loops instead of jumps over single instruction blocksCINC
/CSEL
not emitted inside the loops instead of jumps over single instruction blocks
Won't get to this one in 9.0. |
Another example (CSEL): https://godbolt.org/z/nv5WEv5Tv (yes, it's not an optimal implementation but code like this, pointers notwithstanding, is pretty common) It would be nice if this loop restriction could have been revisited eventually, at least for trivial "up to n if-converted nodes", especially since Intel are also contributing their own APX variants that will also benefit from this. And if this has substantial enough impact, it's going to be seen on the new Cobalt-based runners (unless I'm misremembering?). Thanks. |
Description
It appears that for a straightforward compare -> increment, .NET emits a branch instead of
cinc
when a method containing such code is inlined inside a loop body.Analysis
Given a method
RuneLength
:This compiles to:
We can see that .NET emits
mov
+cmp
+csel
here. This isn'tcinc
but it's a good start.However, if the method is inlined inside a loop body, the codegen is different.
Consider the below method:
Instead of the pattern above, the codegen changes to
cbnz label
andmov 1
which is more compact but is more expensive because it uses branch execution units which have lower throughput per cycle thancsel
which uses integer units instead (on modern ARM cores like Firestorm or Cortex-X1):Configuration
Regression?
No
Happy New Year!
The text was updated successfully, but these errors were encountered: