Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: CMOV aka CMOVcc not generated when method is inlined. #96593

Closed
hexawyz opened this issue Jan 7, 2024 · 3 comments
Closed

JIT: CMOV aka CMOVcc not generated when method is inlined. #96593

hexawyz opened this issue Jan 7, 2024 · 3 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@hexawyz
Copy link

hexawyz commented Jan 7, 2024

Description

When playing with the now infamous "One Billion Row Challenge", I noticed that while the JIT properly emits CMOVG and CMOVL for Min/Max operations, those sometimes get replaced by more classical branches when in the context of a more complex method.
I've put the relevant code here to look at the disassembly: Code on SharpLab

To confirm that what I observed in SharpLab corresponded to what would be a Tier1 codegen, I manually observed a Release build in Visual Studio. I could see the assembly for the method changing between two occurences of a manual breakpoint (using Debugger.Break())

As far as I can tell, the code generation difference is not directly related to JIT inlining, as I would see the conditional jump opcodes before splitting this specific code in a separate method.

Relevant C# Code for the method:

struct ItemData
{
	private int _min;
	private int _max;
	private int _sum;
	private int _count;

	public void Aggregate(int value)
	{
		_min = Math.Min(_min, value);
		_max = Math.Max(_max, value);
		_sum += value;
		_count++;
	}
}

Code of the method before inlining:

ItemData.Aggregate(Int32)
    L0000: mov eax, [rcx]
    L0002: cmp eax, edx
    L0004: cmovg eax, edx
    L0007: mov [rcx], eax
    L0009: mov eax, [rcx+4]
    L000c: cmp eax, edx
    L000e: cmovl eax, edx
    L0011: mov [rcx+4], eax
    L0014: add [rcx+8], edx
    L0017: inc dword ptr [rcx+0xc]
    L001a: ret

Relevant code of the method when inlined:

Parser.Parse(System.Collections.Generic.Dictionary`2<Utf8StringRef,ItemData>, Byte ByRef, Byte ByRef)
…
    L00cc: mov ecx, [r13]
    L00d0: cmp ecx, eax
    L00d2: jle short L00e6 ; Corresponding to cmovg above
    L00d4: mov edx, eax
    L00d6: mov [r13], edx
    L00da: mov ecx, [r13+4]
    L00de: cmp ecx, eax
    L00e0: jge short L010b  ; Corresponding to cmovl above
    L00e2: mov edx, eax
    L00e4: jmp short L010d
    L00e6: mov edx, ecx
    L00e8: jmp short L00d6
…
    L010b: mov edx, ecx
    L010d: mov [r13+4], edx
    L0111: add [r13+8], eax
    L0115: inc dword ptr [r13+0xc]
    L0119: jmp short L0100
    L011b: cmp rdi, rsi
…

I initially thought that the problem with CMOVcc was that it can not have a memory destination, but as can be seen in the non inlined method, the JIT chooses to use it despite that fact, which matches with my intuition that it is the correct thing to do here.
The jumps that I see re-introduced in the inlined version (possibly because the JIT failed to detect the pattern there ?) are at least increasing the size of the assembly, but I'd assume they would possibly degrade the performance of the code, depending on the exact data processed.

I tried to trick the JIT in generating CMOV-based code in multiple different ways there (introducing intermediate variables, etc), but I could not get it to not do it.
IIRC, the worst of the cases I managed to produce was one where the conditional jumps would directly point to JMP opcodes without doing anything else uselful.

Code can also be found here: https://github.com/hexawyz/OneBillionRows/blob/adf45647de4d4d9d49c6002e29bd483cee608af3/OneBillionRows/Program.cs#L328

I apologize for not providing a more minimal repro, but hopefully, this one is still small enough so that you can understand what is happening here.

Configuration

Local: .NET 8.0.100, x64
SharpLab: Core CLR 8.0.23.53103 on x64

@hexawyz hexawyz added the tenet-performance Performance related issue label Jan 7, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 7, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 7, 2024
@ghost
Copy link

ghost commented Jan 7, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

When playing with the now infamous "One Billion Row Challenge", I noticed that while the JIT properly emits CMOVG and CMOVL for Min/Max operations, those sometimes get replaced by more classical branches when in the context of a more complex method.
I've put the relevant code here to look at the disassembly: Code on SharpLab

To confirm that what I observed in SharpLab corresponded to what would be a Tier1 codegen, I manually observed a Release build in Visual Studio. I could see the assembly for the method changing between two occurences of a manual breakpoint (using Debugger.Break())

As far as I can tell, the code generation difference is not directly related to JIT inlining, as I would see the conditional jump opcodes before splitting this specific code in a separate method.

Relevant C# Code for the method:

struct ItemData
{
	private int _min;
	private int _max;
	private int _sum;
	private int _count;

	public void Aggregate(int value)
	{
		_min = Math.Min(_min, value);
		_max = Math.Max(_max, value);
		_sum += value;
		_count++;
	}
}

Code of the method before inlining:

ItemData.Aggregate(Int32)
    L0000: mov eax, [rcx]
    L0002: cmp eax, edx
    L0004: cmovg eax, edx
    L0007: mov [rcx], eax
    L0009: mov eax, [rcx+4]
    L000c: cmp eax, edx
    L000e: cmovl eax, edx
    L0011: mov [rcx+4], eax
    L0014: add [rcx+8], edx
    L0017: inc dword ptr [rcx+0xc]
    L001a: ret

Relevant code of the method when inlined:

Parser.Parse(System.Collections.Generic.Dictionary`2<Utf8StringRef,ItemData>, Byte ByRef, Byte ByRef)
…
    L00cc: mov ecx, [r13]
    L00d0: cmp ecx, eax
    L00d2: jle short L00e6 ; Corresponding to cmovg above
    L00d4: mov edx, eax
    L00d6: mov [r13], edx
    L00da: mov ecx, [r13+4]
    L00de: cmp ecx, eax
    L00e0: jge short L010b  ; Corresponding to cmovl above
    L00e2: mov edx, eax
    L00e4: jmp short L010d
    L00e6: mov edx, ecx
    L00e8: jmp short L00d6
…
    L010b: mov edx, ecx
    L010d: mov [r13+4], edx
    L0111: add [r13+8], eax
    L0115: inc dword ptr [r13+0xc]
    L0119: jmp short L0100
    L011b: cmp rdi, rsi
…

I initially thought that the problem with CMOVcc was that it can not have a memory destination, but as can be seen in the non inlined method, the JIT chooses to use it despite that fact, which matches with my intuition that it is the correct thing to do here.
The jumps that I see re-introduced in the inlined version (possibly because the JIT failed to detect the pattern there ?) are at least increasing the size of the assembly, but I'd assume they would possibly degrade the performance of the code, depending on the exact data processed.

I tried to trick the JIT in generating CMOV-based code in multiple different ways there (introducing intermediate variables, etc), but I could not get it to not do it.
IIRC, the worst of the cases I managed to produce was one where the conditional jumps would directly point to JMP opcodes without doing anything else uselful.

Code can also be found here: https://github.com/hexawyz/OneBillionRows/blob/adf45647de4d4d9d49c6002e29bd483cee608af3/OneBillionRows/Program.cs#L328

I apologize for not providing a more minimal repro, but hopefully, this one is still small enough so that you can understand what is happening here.

Configuration

Local: .NET 8.0.100
SharpLab: Core CLR 8.0.23.53103 on x64

Author: hexawyz
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@neon-sunset
Copy link
Contributor

It's probably this #96380

@jakobbotsch
Copy link
Member

Thanks for the report. Indeed, this is a duplicate of #96380. We do not emit conditional instructions inside loops today because it can cause significant performance regressions, and the right analysis to predict the profitability has not yet been implemented.
I'm going to close this, but we will keep the example you have shared in mind for when we get around to working on this.

@jakobbotsch jakobbotsch closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2024
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jan 7, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants