-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV][VLOPT] Add support for checkUsers when UserMI is a Single-Width Integer Reduction #120345
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Michael Maitland (michaelmaitland) ChangesReductions are weird because for some operands, they are vector registers but only read the first lane. For these operands, we do not need to check to make sure the EEW and EMUL ratios match. However, we need to make sure that when the reduction instruction has a non-zero VL operand, we don't try and set the CommonVL=0. Full diff: https://github.com/llvm/llvm-project/pull/120345.diff 3 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
index a9e5bb6ecd9b8a..fb020e361b4cee 100644
--- a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -568,6 +568,23 @@ static OperandInfo getOperandInfo(const MachineOperand &MO,
return OperandInfo(MIVLMul, MILog2SEW);
}
+ // Vector Reduction Operations
+ // Vector Single-Width Integer Reduction Instructions
+ // The Dest and VS1 only read element 0 of the vector register. Return unknown
+ // for these. VS2 has EEW=SEW and EMUL=LMUL.
+ case RISCV::VREDAND_VS:
+ case RISCV::VREDMAX_VS:
+ case RISCV::VREDMAXU_VS:
+ case RISCV::VREDMIN_VS:
+ case RISCV::VREDMINU_VS:
+ case RISCV::VREDOR_VS:
+ case RISCV::VREDSUM_VS:
+ case RISCV::VREDXOR_VS: {
+ if (MO.getOperandNo() == 2)
+ return OperandInfo(MIVLMul, MILog2SEW);
+ return {};
+ }
+
default:
return {};
}
@@ -886,11 +903,28 @@ bool RISCVVLOptimizer::checkUsers(const MachineOperand *&CommonVL,
// Instructions like reductions may use a vector register as a scalar
// register. In this case, we should treat it like a scalar register which
- // does not impact the decision on whether to optimize VL.
- // TODO: Treat it like a scalar register instead of bailing out.
+ // does not impact the decision on whether to optimize VL. But if there is
+ // another user of MI and it has VL=0, we need to be sure not to reduce the
+ // VL of MI to zero when the VLOp of UserOp is may be non-zero.
if (isVectorOpUsedAsScalarOp(UserOp)) {
- CanReduceVL = false;
- break;
+ [[maybe_unused]] Register R = UserOp.getReg();
+ [[maybe_unused]] const TargetRegisterClass *RC = MRI->getRegClass(R);
+ assert(RISCV::VRRegClass.hasSubClassEq(RC) &&
+ "Expect LMUL 1 register class for vector as scalar operands!");
+ LLVM_DEBUG(dbgs() << " Used this operand as a scalar operand\n");
+ const MCInstrDesc &Desc = UserMI.getDesc();
+ unsigned VLOpNum = RISCVII::getVLOpNum(Desc);
+ const MachineOperand &VLOp = UserMI.getOperand(VLOpNum);
+ if ((VLOp.isReg() && VLOp.getReg() != RISCV::X0) ||
+ (VLOp.isImm() && VLOp.getImm() != 0)) {
+ if (!CommonVL) {
+ CommonVL = &VLOp;
+ continue;
+ } else if (!CommonVL->isIdenticalTo(VLOp)) {
+ CanReduceVL = false;
+ break;
+ }
+ }
}
if (mayReadPastVL(UserMI)) {
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir b/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
index 8587ec136afd83..e151d82e4d6217 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
@@ -892,3 +892,23 @@ body: |
%x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
%y:vr = PseudoVMV_V_V_MF2 $noreg, %x, 1, 3 /* e8 */, 0
...
+---
+name: vred_vs2
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: vred_vs2
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVREDAND_VS_M1_E8 $noreg, %x, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+ %y:vr = PseudoVREDAND_VS_M1_E8 $noreg, %x, $noreg, 1, 3 /* e8 */, 0
+...
+---
+name: vred_vs1
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: vred_vs1
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVREDAND_VS_M1_E8 $noreg, $noreg, %x, 1, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+ %y:vr = PseudoVREDAND_VS_M1_E8 $noreg, $noreg, %x, 1, 3 /* e8 */, 0
+...
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir b/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
index 010e3ca642269b..64a6a09fb6e8dd 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
@@ -33,3 +33,15 @@ body: |
%y:vr = PseudoVREDSUM_VS_M1_E64 $noreg, %x, $noreg, -1, 6 /* e64 */, 0 /* tu, mu */
%z:vr = PseudoVADD_VV_M1 $noreg, %x, $noreg, %vl, 5 /* e32 */, 0 /* tu, mu */
...
+---
+name: vred_other_user_is_vl0
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: vred_other_user_is_vl0
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVREDSUM_VS_M1_E8 $noreg, $noreg, %x, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 $noreg, %x, $noreg, 0, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+ %y:vr = PseudoVREDSUM_VS_M1_E8 $noreg, $noreg, %x, 1, 3 /* e8 */, 0
+ %z:vr = PseudoVADD_VV_M1 $noreg, %x, $noreg, 0, 3 /* e8 */, 0
+...
|
1a0953d
to
da1b1c9
Compare
I think the net effect of this change is to simply treat the scalar value as consuming VL elements. This is a conservative approximation (i.e. correct). However, can't the same effect be achieved by having getOperandInfo return that result for the scalar operand, and then removing the special case entirely? Your code as written is missing the check that the scalar operand has the same SEW as the instruction defining that value. That's a functional bug. |
I made this less conservative and took your advice just to make the VL=1. |
859e30e
to
90b42ce
Compare
ping, for those who are not out on vacation :) |
@@ -12,9 +12,11 @@ define i16 @ctz_v4i32(<4 x i32> %a) { | |||
; RV32-NEXT: vsetvli zero, zero, e8, mf4, ta, ma | |||
; RV32-NEXT: vmv.v.i v8, 0 | |||
; RV32-NEXT: vmerge.vim v8, v8, -1, v0 | |||
; RV32-NEXT: vsetivli zero, 1, e8, mf4, ta, ma | |||
; RV32-NEXT: vid.v v9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be further optimized? since vl=1
so there is only one element 0
in v9
, thus vrsub.vi v9, v9, 4
equals to vmv.x.i v9, 4
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a nice find! I think this optimization should be done in a separate patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this specific to VL=1 though? I think this can hold for all VL when we know that all elements are 0 in the vector source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is VL-specific. vid.v generates a sequence of index ([0, 1, 2, ...]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case with vid, yes it’s vl=1 specific. More generally this problem can be treated as if all lanes of the source up to VL are known to be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is coming from the expansion of the cttz.elts intrinsic, this is probably (maybe?) a missing combine on STEP_VECTOR when VL=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (still) not correct.
I think you're getting yourself confused here, and need to take a step back and revisit. This is (deceptively) simple, and you seem to be tripping up on that. I'd suggest reorganizing this bit of code in terms of an accessor which gets the VL for a given use on an instruction, and then sharing all of the merge/validation logic between this codepath and the existing one. Stop trying to make the current early continue idiom work. It doesn't.
@@ -12,9 +12,11 @@ define i16 @ctz_v4i32(<4 x i32> %a) { | |||
; RV32-NEXT: vsetvli zero, zero, e8, mf4, ta, ma | |||
; RV32-NEXT: vmv.v.i v8, 0 | |||
; RV32-NEXT: vmerge.vim v8, v8, -1, v0 | |||
; RV32-NEXT: vsetivli zero, 1, e8, mf4, ta, ma | |||
; RV32-NEXT: vid.v v9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is coming from the expansion of the cttz.elts intrinsic, this is probably (maybe?) a missing combine on STEP_VECTOR when VL=1.
@preames, thanks for another look through.
I wasn't 100% sure what you meant by
Now we should be using the SEW correctness by checking for compatible OperandInfo objects. There is a test in I've also made it so we don't unconditionally set CommonVL = One, which was incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/minor comment
// If the operand is used as a scalar operand, then the EEW must be | ||
// compatible. Otherwise, the EMUL *and* EEW must be compatible. | ||
if ((isVectorOpUsedAsScalarOp(UserOp) && | ||
!OperandInfo::EEWAreEqual(ConsumerInfo, ProducerInfo)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a possible follow up - in general, I think a larger source LMUL is fine. You could unify the code here by returning the smallest legal LMUL for the given SEW for the scalar source operand, and then using a greater than comparison for EMUL here. Not sure if that improves readability or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way of phrasing this could be "is vlmax known greater than or equal to"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to address this in a follow up patch.
Reductions are weird because for some operands, they are vector registers but only read the first lane. For these operands, we do not need to check to make sure the EEW and EMUL ratios match. However, we need to make sure that when the reduction instruction has a non-zero VL operand, we don't try and set the CommonVL=0. Since this is an edge case, we decide just not to optimize anything when this occurs.
59bfd57
to
fe55f64
Compare
if (VLOp.isReg() || (VLOp.isImm() && VLOp.getImm() != 0)) | ||
return MachineOperand::CreateImm(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know if a register is non-zero.
Which makes me wonder do we need to check the user's VL? I think we can return CreateImm(1) every time even if the user has a VL of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know if a register is non-zero.
That is true, but if it is a register, then we need to conservatively treat it that it might, which is why we return 1. If we just returned 1 if it was a non-zero immediate, then I think that is a missed optimization opportunity.
I think what we should really be doing here is returning VLOp in the else
branch of this check. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually why can't we always just return VLOp for the scalar operand of a reduction?
If VLOp is non-zero then the source's VL won't get reduced to anything below one.
If VLOp is zero then the user is a no-op so it doesn't matter what the source's VL is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine you have the following:
%v = VADD_VV ...
%s = VREDSUM w/ %v as scalar source
%dead = VADD_VV %v, %v w/ VL=0
When scanning the users of %v, we will decide that the correct VL for %v is 0 (or a register which might be zero), and reduce it below the minimum VL=1 required by the reduction. To fix this, we need to treat the CommonVL for the scalar operand case as being VL=1.
If VLOp is zero then the user is a no-op so it doesn't matter what the source's VL is?
I don't think this is true in this case? We need to keep it at 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When scanning the users of %v, we will decide that the correct VL for %v is 0 (or a register which might be zero)
IIUC though the only time the VL would be reduced to a register that might be zero is if the VREDSUM also uses the same register for its VL, which would also have to be zero?
Since I thought the RISCV::isVLKnownLE(*CommonVL, *VLOp)
check guarantees that we will never reduce the VL below anything we return in getMinimumVLForUser
.
We may be able to reduce it further to 1.
Agreed, I think we can return either 1 or VLOp. I just think that we don't need to check the value of VLOp here for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may be able to reduce it further to 1.
I edited my previous comment to say:
We need to keep it at 1.
I think the code as written makes it so we don't have to worry whether your following statement is true, and it comes at very little extra cost to do the check:
IIUC though the only time the VL would be reduced to a register that might be zero is if the VREDSUM also uses the same register for its VL, which would also have to be zero?
I do have MIR tests in this PR where I don't think this assumption holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something here. We have two claims in the comments.
we need to be sure not to reduce the VL of MI to zero when the VLOp of UserOp may be non-zero
This implies that we may reduce MI's VL to something smaller than a user's VL.
But meanwhile in checkUsers:
Use the largest VL among all the users. If we cannot determine this statically, then we cannot optimize the VL.
This implies that we never reduce MI's VL to something smaller than a user's VL.
My understanding is that checkUsers is correct here, so we will never reduce MI's VL to zero if a user has a non-zero VL.
And if we remove the isVectorOpUsedAsScalarOp check, the MIR test case is still correct:
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
index 0b437e63fa3d..860e56fba839 100644
--- a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -1066,26 +1066,6 @@ RISCVVLOptimizer::getMinimumVLForUser(MachineOperand &UserOp) {
return std::nullopt;
}
- // Instructions like reductions may use a vector register as a scalar
- // register. In this case, we should treat it like a scalar register which
- // does not impact the decision on whether to optimize VL. But if there is
- // another user of MI and it may have VL=0, we need to be sure not to reduce
- // the VL of MI to zero when the VLOp of UserOp may be non-zero. The most
- // we can reduce it to is one.
- if (isVectorOpUsedAsScalarOp(UserOp)) {
- [[maybe_unused]] Register R = UserOp.getReg();
- [[maybe_unused]] const TargetRegisterClass *RC = MRI->getRegClass(R);
- assert(RISCV::VRRegClass.hasSubClassEq(RC) &&
- "Expect LMUL 1 register class for vector as scalar operands!");
- LLVM_DEBUG(dbgs() << " Used this operand as a scalar operand\n");
-
- unsigned VLOpNum = RISCVII::getVLOpNum(Desc);
- const MachineOperand &VLOp = UserMI.getOperand(VLOpNum);
- return VLOp.isReg() || (VLOp.isImm() && VLOp.getImm() != 0)
- ? MachineOperand::CreateImm(1)
- : VLOp;
- }
-
unsigned VLOpNum = RISCVII::getVLOpNum(Desc);
const MachineOperand &VLOp = UserMI.getOperand(VLOpNum);
// Looking for an immediate or a register VL that isn't X0.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir b/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
index 9486fe92c041..c3e09d6b1438 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
@@ -1172,13 +1172,13 @@ name: vred_vl0_and_vlreg
body: |
bb.0:
; CHECK-LABEL: name: vred_vl0_and_vlreg
; CHECK: %vl:gprnox0 = COPY $x1
- ; CHECK-NEXT: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
; CHECK-NEXT: %y:vr = PseudoVREDSUM_VS_M1_E8 $noreg, $noreg, %x, %vl, 3 /* e8 */, 0 /* tu, mu */
; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 $noreg, %x, $noreg, 0, 3 /* e8 */, 0 /* tu, mu */
%vl:gprnox0 = COPY $x1
%x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
%y:vr = PseudoVREDSUM_VS_M1_E8 $noreg, $noreg, %x, %vl, 3 /* e8 */, 0
%z:vr = PseudoVADD_VV_M1 $noreg, %x, $noreg, 0, 3 /* e8 */, 0
...
---
checkUsers doesn't reduce the VL because VREDSUM's VL %x
is not statically known to be greater than or equal to VADD's VL 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're in agreement that we don't need the isVectorOpUsedAsScalarOp
logic in getMinimumVLForUser for correctness. But we do need it to avoid the missed optimization opportunity in vred_vl0_and_vlreg
.
Regarding some of your takeaways in the last comment:
we need to be sure not to reduce the VL of MI to zero when the VLOp of UserOp may be non-zero
This implies that we may reduce MI's VL to something smaller than a user's VL.
I agree. This occurs only in the case where we would return VL=1 in the case when the scalar operand is a register or non-zero immediate. This is because we only read the first lane of the vector register.
Use the largest VL among all the users. If we cannot determine this statically, then we cannot optimize the VL.
This implies that we never reduce MI's VL to something smaller than a user's VL.
I disagree with this implication. largest VL among all the users
really refers to the largest getMinimumVLForUser
, which may be smaller than a users VL.
So in reality, these two statements are not conflicting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should have said "smaller than the user's minimum VL", i.e. it won't reduce it to something that's incorrect.
In terms of this check though, if I'm reading it right it's equivalent to this:
if (VLOp.isImm() && VLOp.getImm() == 0)
return std::nullopt;
return MachineOperand::CreateImm(1);
I'm not sure why we need to abort if VLOp is an immediate 0, given it should be correct to return a larger minimum VL of 1 too.
So I think we should drop the check and always return a MachineOperand::CreateImm(1)
. Or we could also return MachineOperand::CreateImm(0)
for that case as a further optimization, but I don't think VLOp = 0 is that common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luke, I'm only skimming this subthread, but can I ask that you clearly separate and treat different any potential correctness concern from a missed optimization? I think the code as written was and is correct. Am I missing something? A missed optimization is best handled in a follow up review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reductions are weird because for some operands, they are vector registers but only read the first lane. For these operands, we do not need to check to make sure the EEW and EMUL ratios match. However, we need to make sure that when the reduction instruction has a non-zero VL operand, we don't try and set the CommonVL=0.