Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize VectorT.ConditionalSelect for constant masks #104092

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ezhevita
Copy link

@ezhevita ezhevita commented Jun 27, 2024

Resolves #104001.
I’ve also realized that it is not possible to implement this for arrays as I’ve originally intended since we have no immutability guarantees, so instead I’ve made sure it gets optimized with ReadOnlySpan<T> property and static VectorX<T> field.

Example:

[DisassemblyDiagnoser]
public class TestClass
{
    private ReadOnlySpan<byte> Mask => [255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0];

    [Benchmark]
    public byte Span()
    {
        return Vector128.ConditionalSelect(Vector128.Create(Mask), Vector128<byte>.Zero, Vector128<byte>.One)[0];
    }
}

Old codegen:

; Method TestNamespace.TestClass:Span():ubyte:this (FullOpts)
G_M000_IG01:

G_M000_IG02:
       vmovups  xmm0, xmmword ptr [reloc @RWD00]
       vxorps   xmm1, xmm1, xmm1
       vpand    xmm1, xmm1, xmm0
       vpandn   xmm0, xmm0, xmmword ptr [reloc @RWD16]
       vpor     xmm0, xmm0, xmm1
       vmovd    eax, xmm0
       movzx    rax, al

G_M000_IG03:
       ret      
RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh
RWD16  	dq	0101010101010101h, 0101010101010101h
; Total bytes of code: 36

New codegen:

; Method TestNamespace.TestClass:Span():ubyte:this (FullOpts)
G_M000_IG01:

G_M000_IG02:
       vmovups  xmm0, xmmword ptr [reloc @RWD00]
       vxorps   xmm1, xmm1, xmm1
       vmovups  xmm2, xmmword ptr [reloc @RWD16]
       vpblendvb xmm0, xmm2, xmm1, xmm0
       vmovd    eax, xmm0
       movzx    rax, al

G_M000_IG03:
       ret      
RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh
RWD16  	dq	0101010101010101h, 0101010101010101h
; Total bytes of code: 34

This adds a check in the JIT for constant masks (`GT_CNS_VEC`, everything else gets lowered to it) and enables optimization to `BlendVariable` (`(v)pblendvb` instruction).
This currently does not work for masks loaded from an array in a field/variable.
Also this optimization is not triggered for platforms supporting AVX512F(/VL?) since it gets optimized earlier to `vpternlogd` instruction.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 27, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@ezhevita
Copy link
Author

@dotnet-policy-service agree

if (IsCnsVec())
{
const GenTreeVecCon* vecCon = AsVecCon();
if (vecCon->IsAllBitsSet() || vecCon->IsZero())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these checks? I doubt they save any measurable TP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, these functions just loop the memory checking if all elements are a given value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding suggested that these checks should be in place here: https://discord.com/channels/143867839282020352/312132327348240384/1255568916151930990
However ConditionalSelect with all bits set/all bits zero as a mask is pretty useless, so we can get rid of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consideration is rather just that main loop below already handles this and these checks aren't "cheap", they are just doing a loop and comparison themselves.

So by removing these explicit checks, we can loop once rather than 3 times and still handle all the scenarios.

// Returns:
// True if this node is a vector constant per-element mask
//
bool GenTree::IsVectorPerElementMask(var_types simdBaseType, unsigned simdSize) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we can name it as ElementsAreAllBitsSetOrZero as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've explicitly used the term PerElementMask elsewhere throughout the JIT to encompass the general concept necessary here.

@@ -29442,6 +29442,57 @@ bool GenTree::IsInvariant() const
return OperIsConst() || OperIs(GT_LCL_ADDR) || OperIs(GT_FTN_ADDR);
}

//-------------------------------------------------------------------
// IsVectorPerElementMask: returns true if this node is a vector constant per-element mask
// (every element has either all bits set or none of them).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comments for parameters are missing

Comment on lines +29487 to +29488
// TODO-XARCH-AVX512 Use VPBLENDM* and take input directly from K registers if cond is from
// MoveMaskToVectorSpecial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't applicable to the general query, it was specific to the CndSel lowering logic

{
// TODO-XARCH-AVX512 Use VPBLENDM* and take input directly from K registers if cond is from
// MoveMaskToVectorSpecial.
return HWIntrinsicInfo::ReturnsPerElementMask(AsHWIntrinsic()->GetHWIntrinsicId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to, you could also update this just a little bit more as there's some very simple cases that we could cover that would also allow us to optimize further.

In particular, you could have this do:

GenTreeHWIntrinsic* intrinsic   = AsHWIntrinsic();
NamedIntrinsic      intrinsicId = intrinsic->GetHWIntrinsicId();

if (HWIntrinsicInfo::ReturnsPerElementMask(intrinsicId))
{
    // We directly return a per-element mask
    return true;
}

bool       isScalar = false;
genTreeOps oper     = intrinsic->HWOperGet(&isScalar);

switch (oper)
{
    case GT_AND:
    case GT_AND_NOT:
    case GT_OR:
    case GT_XOR:
    {
        // We are a binary bitwise operation where both inputs are per-element masks
        return intrinsic->Op(1)->IsVectorPerElementMask(simdBaseType, simdSize)
            && intrinsic->Op(2)->IsVectorPerElementMask(simdBaseType, simdSize);
    }

    case GT_NOT:
    {
        // We are an unary bitwise operation where the input is a per-element mask
        return intrinsic->Op(1)->IsVectorPerElementMask(simdBaseType, simdSize);
    }

    default:
    {
        assert(!GenTreeHWIntrinsic::OperIsBitwiseHWIntrinsic(oper));
        break;
    }
}

return false;

There's others that qualify as well, but I expect they're more rare and the bitwise operations are the most important to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VectorT.ConditionalSelect doesn’t get optimized for const masks on non-AVX512 platforms
3 participants