Skip to content

[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable #130577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Apr 1, 2025
Merged
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fc7a509
Narrow 64 bit math to 32 bit if profitable
Shoreshen Mar 10, 2025
0fe9dbc
add tests
Shoreshen Mar 10, 2025
9df0718
fix mul, remove sub
Shoreshen Mar 10, 2025
a5084d2
fix lit.cfg.py
Shoreshen Mar 10, 2025
2e2d190
fix test
Shoreshen Mar 10, 2025
2063614
fix variable name
Shoreshen Mar 11, 2025
af47303
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 11, 2025
0ac2f9e
fix comments
Shoreshen Mar 11, 2025
f7d0769
fix comments
Shoreshen Mar 11, 2025
f54c570
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 12, 2025
68ef90b
move from aggressive-instcombine to codegenprepare
Shoreshen Mar 12, 2025
ad9c30d
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 13, 2025
f4fb6d0
move to amdgpu-codegenprepare
Shoreshen Mar 13, 2025
c7fbcd1
fix comments
Shoreshen Mar 13, 2025
bc8d2a2
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 13, 2025
29b30c9
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 17, 2025
b03ea21
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 18, 2025
aef04fa
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 19, 2025
e40fbf2
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 20, 2025
f946445
fix comments
Shoreshen Mar 20, 2025
ab4b6ce
fix lit
Shoreshen Mar 20, 2025
4159ffb
fix format
Shoreshen Mar 20, 2025
9d4736c
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 21, 2025
4c53694
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 24, 2025
4501fcf
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 25, 2025
d44ee75
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 26, 2025
9bfea1d
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 27, 2025
f7357db
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 28, 2025
279009c
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 28, 2025
c55754c
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Mar 31, 2025
fa00e4d
Merge branch 'main' into narrow-math-for-and-operand
Shoreshen Apr 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
Original file line number Diff line number Diff line change
@@ -1561,6 +1561,87 @@ void AMDGPUCodeGenPrepareImpl::expandDivRem64(BinaryOperator &I) const {
llvm_unreachable("not a division");
}

Type *findSmallestLegalBits(Instruction *I, int OrigBit, int MaxBitsNeeded,
const TargetLowering *TLI, const DataLayout &DL) {
if (MaxBitsNeeded >= OrigBit)
return nullptr;

Type *NewType = I->getType()->getWithNewBitWidth(MaxBitsNeeded);
while (OrigBit > MaxBitsNeeded) {
if (TLI->isOperationLegalOrCustom(
TLI->InstructionOpcodeToISD(I->getOpcode()),
TLI->getValueType(DL, NewType, true)))
return NewType;

MaxBitsNeeded *= 2;
NewType = I->getType()->getWithNewBitWidth(MaxBitsNeeded);
}
return nullptr;
}

static bool tryNarrowMathIfNoOverflow(Instruction *I, const TargetLowering *TLI,
const TargetTransformInfo &TTI,
const DataLayout &DL) {
unsigned Opc = I->getOpcode();
Type *OldType = I->getType();

if (Opc != Instruction::Add && Opc != Instruction::Mul)
return false;

unsigned OrigBit = OldType->getScalarSizeInBits();
unsigned MaxBitsNeeded = OrigBit;

switch (Opc) {
case Instruction::Add:
MaxBitsNeeded = KnownBits::add(computeKnownBits(I->getOperand(0), DL),
computeKnownBits(I->getOperand(1), DL))
.countMaxActiveBits();
break;
case Instruction::Mul:
MaxBitsNeeded = KnownBits::mul(computeKnownBits(I->getOperand(0), DL),
computeKnownBits(I->getOperand(1), DL))
.countMaxActiveBits();
break;
default:
llvm_unreachable("Unexpected opcode, only valid for Instruction::Add and "
"Instruction::Mul.");
}

MaxBitsNeeded = std::max<unsigned>(bit_ceil(MaxBitsNeeded), 8);
Type *NewType = findSmallestLegalBits(I, OrigBit, MaxBitsNeeded, TLI, DL);

if (!NewType)
return false;

// Old cost
InstructionCost OldCost =
TTI.getArithmeticInstrCost(Opc, OldType, TTI::TCK_RecipThroughput);
// New cost of new op
InstructionCost NewCost =
TTI.getArithmeticInstrCost(Opc, NewType, TTI::TCK_RecipThroughput);
// New cost of narrowing 2 operands (use trunc)
NewCost += 2 * TTI.getCastInstrCost(Instruction::Trunc, NewType, OldType,
TTI.getCastContextHint(I),
TTI::TCK_RecipThroughput);
// New cost of zext narrowed result to original type
NewCost +=
TTI.getCastInstrCost(Instruction::ZExt, OldType, NewType,
TTI.getCastContextHint(I), TTI::TCK_RecipThroughput);
if (NewCost >= OldCost)
return false;
Comment on lines +1616 to +1631
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cost makes the transformation too conservative. Usually the truncs will be removed in the final code. Also, it does not include the benefit of using fewer registers with the narrower operations.

Copy link
Contributor Author

@Shoreshen Shoreshen Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LU-JOHN , maybe yes, but currently I do have other cost strategy in my mind........ BTW, the cost of trunc is 0 from AMD backend, the narrowed arith is proportionally to the amount of bit narrowed. I'm not sure that if this took count in the saving of registers....


IRBuilder<> Builder(I);
Value *Trunc0 = Builder.CreateTrunc(I->getOperand(0), NewType);
Value *Trunc1 = Builder.CreateTrunc(I->getOperand(1), NewType);
Value *Arith =
Builder.CreateBinOp((Instruction::BinaryOps)Opc, Trunc0, Trunc1);

Value *Zext = Builder.CreateZExt(Arith, OldType);
I->replaceAllUsesWith(Zext);
I->eraseFromParent();
return true;
}

bool AMDGPUCodeGenPrepareImpl::visitBinaryOperator(BinaryOperator &I) {
if (foldBinOpIntoSelect(I))
return true;
@@ -1645,6 +1726,9 @@ bool AMDGPUCodeGenPrepareImpl::visitBinaryOperator(BinaryOperator &I) {
}
}

Changed = tryNarrowMathIfNoOverflow(&I, ST.getTargetLowering(),
TM.getTargetTransformInfo(F), DL);

return Changed;
}

5 changes: 4 additions & 1 deletion llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-mul24.ll
Original file line number Diff line number Diff line change
@@ -414,7 +414,10 @@ define i64 @umul24_i64_2(i64 %lhs, i64 %rhs) {
; DISABLED-LABEL: @umul24_i64_2(
; DISABLED-NEXT: [[LHS24:%.*]] = and i64 [[LHS:%.*]], 65535
; DISABLED-NEXT: [[RHS24:%.*]] = and i64 [[RHS:%.*]], 65535
; DISABLED-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
; DISABLED-NEXT: [[TMP1:%.*]] = trunc i64 [[LHS24]] to i32
; DISABLED-NEXT: [[TMP2:%.*]] = trunc i64 [[RHS24]] to i32
; DISABLED-NEXT: [[TMP3:%.*]] = mul i32 [[TMP1]], [[TMP2]]
; DISABLED-NEXT: [[MUL:%.*]] = zext i32 [[TMP3]] to i64
; DISABLED-NEXT: ret i64 [[MUL]]
;
%lhs24 = and i64 %lhs, 65535
52 changes: 24 additions & 28 deletions llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
Original file line number Diff line number Diff line change
@@ -1823,22 +1823,22 @@ define amdgpu_kernel void @add_i64_constant(ptr addrspace(1) %out, ptr addrspace
; GFX1264: ; %bb.0: ; %entry
; GFX1264-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX1264-NEXT: s_mov_b64 s[6:7], exec
; GFX1264-NEXT: s_mov_b32 s9, 0
; GFX1264-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0
; GFX1264-NEXT: s_mov_b64 s[4:5], exec
; GFX1264-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0
; GFX1264-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1264-NEXT: v_mbcnt_hi_u32_b32 v2, s7, v0
; GFX1264-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1264-NEXT: v_cmpx_eq_u32_e32 0, v2
; GFX1264-NEXT: s_cbranch_execz .LBB3_2
; GFX1264-NEXT: ; %bb.1:
; GFX1264-NEXT: s_bcnt1_i32_b64 s8, s[6:7]
; GFX1264-NEXT: s_bcnt1_i32_b64 s6, s[6:7]
; GFX1264-NEXT: v_mov_b32_e32 v1, 0
; GFX1264-NEXT: s_wait_alu 0xfffe
; GFX1264-NEXT: s_mul_i32 s6, s6, 5
; GFX1264-NEXT: s_mov_b32 s11, 0x31016000
; GFX1264-NEXT: s_mul_u64 s[6:7], s[8:9], 5
; GFX1264-NEXT: s_mov_b32 s10, -1
; GFX1264-NEXT: s_wait_alu 0xfffe
; GFX1264-NEXT: v_mov_b32_e32 v0, s6
; GFX1264-NEXT: v_mov_b32_e32 v1, s7
; GFX1264-NEXT: s_mov_b32 s10, -1
; GFX1264-NEXT: s_wait_kmcnt 0x0
; GFX1264-NEXT: s_mov_b32 s8, s2
; GFX1264-NEXT: s_mov_b32 s9, s3
@@ -1860,29 +1860,27 @@ define amdgpu_kernel void @add_i64_constant(ptr addrspace(1) %out, ptr addrspace
; GFX1232-LABEL: add_i64_constant:
; GFX1232: ; %bb.0: ; %entry
; GFX1232-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX1232-NEXT: s_mov_b32 s7, exec_lo
; GFX1232-NEXT: s_mov_b32 s5, 0
; GFX1232-NEXT: v_mbcnt_lo_u32_b32 v2, s7, 0
; GFX1232-NEXT: s_mov_b32 s6, exec_lo
; GFX1232-NEXT: s_mov_b32 s4, exec_lo
; GFX1232-NEXT: v_mbcnt_lo_u32_b32 v2, s6, 0
; GFX1232-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1232-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-NEXT: v_cmpx_eq_u32_e32 0, v2
; GFX1232-NEXT: s_cbranch_execz .LBB3_2
; GFX1232-NEXT: ; %bb.1:
; GFX1232-NEXT: s_bcnt1_i32_b32 s4, s7
; GFX1232-NEXT: s_bcnt1_i32_b32 s5, s6
; GFX1232-NEXT: s_mov_b32 s11, 0x31016000
; GFX1232-NEXT: s_mul_u64 s[4:5], s[4:5], 5
; GFX1232-NEXT: s_mul_i32 s5, s5, 5
; GFX1232-NEXT: s_mov_b32 s10, -1
; GFX1232-NEXT: v_dual_mov_b32 v0, s4 :: v_dual_mov_b32 v1, s5
; GFX1232-NEXT: v_dual_mov_b32 v0, s5 :: v_dual_mov_b32 v1, 0
; GFX1232-NEXT: s_wait_kmcnt 0x0
; GFX1232-NEXT: s_mov_b32 s8, s2
; GFX1232-NEXT: s_mov_b32 s9, s3
; GFX1232-NEXT: buffer_atomic_add_u64 v[0:1], off, s[8:11], null th:TH_ATOMIC_RETURN scope:SCOPE_DEV
; GFX1232-NEXT: s_wait_loadcnt 0x0
; GFX1232-NEXT: global_inv scope:SCOPE_DEV
; GFX1232-NEXT: .LBB3_2:
; GFX1232-NEXT: s_wait_alu 0xfffe
; GFX1232-NEXT: s_or_b32 exec_lo, exec_lo, s6
; GFX1232-NEXT: s_or_b32 exec_lo, exec_lo, s4
; GFX1232-NEXT: s_wait_kmcnt 0x0
; GFX1232-NEXT: v_readfirstlane_b32 s3, v1
; GFX1232-NEXT: v_readfirstlane_b32 s2, v0
@@ -5372,22 +5370,22 @@ define amdgpu_kernel void @sub_i64_constant(ptr addrspace(1) %out, ptr addrspace
; GFX1264: ; %bb.0: ; %entry
; GFX1264-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX1264-NEXT: s_mov_b64 s[6:7], exec
; GFX1264-NEXT: s_mov_b32 s9, 0
; GFX1264-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0
; GFX1264-NEXT: s_mov_b64 s[4:5], exec
; GFX1264-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0
; GFX1264-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1264-NEXT: v_mbcnt_hi_u32_b32 v2, s7, v0
; GFX1264-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1264-NEXT: v_cmpx_eq_u32_e32 0, v2
; GFX1264-NEXT: s_cbranch_execz .LBB9_2
; GFX1264-NEXT: ; %bb.1:
; GFX1264-NEXT: s_bcnt1_i32_b64 s8, s[6:7]
; GFX1264-NEXT: s_bcnt1_i32_b64 s6, s[6:7]
; GFX1264-NEXT: v_mov_b32_e32 v1, 0
; GFX1264-NEXT: s_wait_alu 0xfffe
; GFX1264-NEXT: s_mul_i32 s6, s6, 5
; GFX1264-NEXT: s_mov_b32 s11, 0x31016000
; GFX1264-NEXT: s_mul_u64 s[6:7], s[8:9], 5
; GFX1264-NEXT: s_mov_b32 s10, -1
; GFX1264-NEXT: s_wait_alu 0xfffe
; GFX1264-NEXT: v_mov_b32_e32 v0, s6
; GFX1264-NEXT: v_mov_b32_e32 v1, s7
; GFX1264-NEXT: s_mov_b32 s10, -1
; GFX1264-NEXT: s_wait_kmcnt 0x0
; GFX1264-NEXT: s_mov_b32 s8, s2
; GFX1264-NEXT: s_mov_b32 s9, s3
@@ -5412,29 +5410,27 @@ define amdgpu_kernel void @sub_i64_constant(ptr addrspace(1) %out, ptr addrspace
; GFX1232-LABEL: sub_i64_constant:
; GFX1232: ; %bb.0: ; %entry
; GFX1232-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX1232-NEXT: s_mov_b32 s7, exec_lo
; GFX1232-NEXT: s_mov_b32 s5, 0
; GFX1232-NEXT: v_mbcnt_lo_u32_b32 v2, s7, 0
; GFX1232-NEXT: s_mov_b32 s6, exec_lo
; GFX1232-NEXT: s_mov_b32 s4, exec_lo
; GFX1232-NEXT: v_mbcnt_lo_u32_b32 v2, s6, 0
; GFX1232-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1232-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-NEXT: v_cmpx_eq_u32_e32 0, v2
; GFX1232-NEXT: s_cbranch_execz .LBB9_2
; GFX1232-NEXT: ; %bb.1:
; GFX1232-NEXT: s_bcnt1_i32_b32 s4, s7
; GFX1232-NEXT: s_bcnt1_i32_b32 s5, s6
; GFX1232-NEXT: s_mov_b32 s11, 0x31016000
; GFX1232-NEXT: s_mul_u64 s[4:5], s[4:5], 5
; GFX1232-NEXT: s_mul_i32 s5, s5, 5
; GFX1232-NEXT: s_mov_b32 s10, -1
; GFX1232-NEXT: v_dual_mov_b32 v0, s4 :: v_dual_mov_b32 v1, s5
; GFX1232-NEXT: v_dual_mov_b32 v0, s5 :: v_dual_mov_b32 v1, 0
; GFX1232-NEXT: s_wait_kmcnt 0x0
; GFX1232-NEXT: s_mov_b32 s8, s2
; GFX1232-NEXT: s_mov_b32 s9, s3
; GFX1232-NEXT: buffer_atomic_sub_u64 v[0:1], off, s[8:11], null th:TH_ATOMIC_RETURN scope:SCOPE_DEV
; GFX1232-NEXT: s_wait_loadcnt 0x0
; GFX1232-NEXT: global_inv scope:SCOPE_DEV
; GFX1232-NEXT: .LBB9_2:
; GFX1232-NEXT: s_wait_alu 0xfffe
; GFX1232-NEXT: s_or_b32 exec_lo, exec_lo, s6
; GFX1232-NEXT: s_or_b32 exec_lo, exec_lo, s4
; GFX1232-NEXT: s_wait_kmcnt 0x0
; GFX1232-NEXT: v_readfirstlane_b32 s2, v0
; GFX1232-NEXT: v_mul_u32_u24_e32 v0, 5, v2
Loading