[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable #130577

Shoreshen · 2025-03-10T10:50:52Z

For Add, Sub, Mul with Int64 type, if profitable, then do:

Trunc operands to Int32 type
Apply 32 bit Add/Sub/Mul
Zext to Int64 type

llvmbot · 2025-03-10T10:51:29Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: None (Shoreshen)

Changes

For Add, Sub, Mul with Int64 type, if profitable, then do:

Trunc operands to Int32 type
Apply 32 bit Add/Sub/Mul
Zext to Int64 type

Full diff: https://github.com/llvm/llvm-project/pull/130577.diff

1 Files Affected:

(modified) llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp (+44)

diff --git a/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp b/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
index 6b0f568864fd5..73bd75f37cc71 100644
--- a/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
+++ b/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
@@ -1224,6 +1224,49 @@ static bool foldLibCalls(Instruction &I, TargetTransformInfo &TTI,
   return false;
 }
 
+static bool tryNarrowMathIfNoOverflow(Instruction &I,
+                                      TargetTransformInfo &TTI) {
+  unsigned opc = I.getOpcode();
+  if (opc != Instruction::Add && opc != Instruction::Sub &&
+      opc != Instruction::Mul) {
+    return false;
+  }
+  LLVMContext &ctx = I.getContext();
+  Type *i64type = Type::getInt64Ty(ctx);
+  Type *i32type = Type::getInt32Ty(ctx);
+
+  if (I.getType() != i64type || !TTI.isTruncateFree(i64type, i32type)) {
+    return false;
+  }
+  InstructionCost costOp64 =
+      TTI.getArithmeticInstrCost(opc, i64type, TTI::TCK_RecipThroughput);
+  InstructionCost costOp32 =
+      TTI.getArithmeticInstrCost(opc, i32type, TTI::TCK_RecipThroughput);
+  InstructionCost costZext64 = TTI.getCastInstrCost(
+      Instruction::ZExt, i64type, i32type, TTI.getCastContextHint(&I),
+      TTI::TCK_RecipThroughput);
+  if ((costOp64 - costOp32) <= costZext64) {
+    return false;
+  }
+  uint64_t AndConst0, AndConst1;
+  Value *X;
+  if ((match(I.getOperand(0), m_And(m_Value(X), m_ConstantInt(AndConst0))) ||
+       match(I.getOperand(0), m_And(m_ConstantInt(AndConst0), m_Value(X)))) &&
+      AndConst0 <= 2147483647 &&
+      (match(I.getOperand(1), m_And(m_Value(X), m_ConstantInt(AndConst1))) ||
+       match(I.getOperand(1), m_And(m_ConstantInt(AndConst1), m_Value(X)))) &&
+      AndConst1 <= 2147483647) {
+    IRBuilder<> Builder(&I);
+    Value *trun0 = Builder.CreateTrunc(I.getOperand(0), i32type);
+    Value *trun1 = Builder.CreateTrunc(I.getOperand(1), i32type);
+    Value *arith32 = Builder.CreateAdd(trun0, trun1);
+    Value *zext64 = Builder.CreateZExt(arith32, i64type);
+    I.replaceAllUsesWith(zext64);
+    I.eraseFromParent();
+  }
+  return false;
+}
+
 /// This is the entry point for folds that could be implemented in regular
 /// InstCombine, but they are separated because they are not expected to
 /// occur frequently and/or have more than a constant-length pattern match.
@@ -1256,6 +1299,7 @@ static bool foldUnusualPatterns(Function &F, DominatorTree &DT,
       // needs to be called at the end of this sequence, otherwise we may make
       // bugs.
       MadeChange |= foldLibCalls(I, TTI, TLI, AC, DT, DL, MadeCFGChange);
+      MadeChange |= tryNarrowMathIfNoOverflow(I, TTI);
     }
   }

tgymnich · 2025-03-10T11:23:22Z

could you please add some test cases.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp

shiltian

How about sext operations with sign bit is 1?

arsenm · 2025-03-11T02:29:01Z

llvm/test/Transforms/AggressiveInstCombine/narrow_math_for_and.ll

+  %zext1 = and i64 %b, 2
+  %mul = mul i64 %zext0, %zext1
+  ret i64 %mul
+}


Test vector cases

llvm/test/Transforms/AggressiveInstCombine/narrow_math_for_and.ll

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp

Shoreshen · 2025-03-11T02:31:05Z

How about sext operations with sign bit is 1?

Hi @shiltian , there may be problem with sext if I'm not wrong, using the following example:

define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 1073741824 ; 0x0000000040000000
  %zext1 = and i64 %b, 1073741824 ; 0x0000000040000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}

So %zext0 and %zext1 either going to equal to 0x0000000040000000 or 0 in 64 bit .

When %zext0=%zext2=0x40000000 then %zext0 + %zext2 = 0x0000000080000000

If I truncate both %zext0 and %zext1 into 32bit, then I have 0x40000000 and truncated add for 32 bit is 0x80000000

The 31'th bit is 1, with sext this will extend to 0xFFFFFFFF80000000, which is not equals to %zext0 + %zext2

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp

dtcxzyw · 2025-03-11T02:39:33Z

How about sext operations with sign bit is 1?

Hi @shiltian , there may be problem with sext if I'm not wrong, using the following example:
define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 1073741824 ; 0x0000000040000000
  %zext1 = and i64 %b, 1073741824 ; 0x0000000040000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}
So %zext0 and %zext1 either going to equal to 0x0000000040000000 or 0 in 64 bit .

When %zext0=%zext2=0x40000000 then %zext0 + %zext2 = 0x0000000080000000

If I truncate both %zext0 and %zext1 into 32bit, then I have 0x40000000 and truncated add for 32 bit is 0x80000000

The 31'th bit is 1, with sext this will extend to 0xFFFFFFFF80000000, which is not equals to %zext0 + %zext2

You should check whether both LHS and RHS have more than 33 sign bits.

Shoreshen · 2025-03-11T02:54:33Z

How about sext operations with sign bit is 1?

Hi @shiltian , there may be problem with sext if I'm not wrong, using the following example:
define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 1073741824 ; 0x0000000040000000
  %zext1 = and i64 %b, 1073741824 ; 0x0000000040000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}
So %zext0 and %zext1 either going to equal to 0x0000000040000000 or 0 in 64 bit .
When %zext0=%zext2=0x40000000 then %zext0 + %zext2 = 0x0000000080000000
If I truncate both %zext0 and %zext1 into 32bit, then I have 0x40000000 and truncated add for 32 bit is 0x80000000
The 31'th bit is 1, with sext this will extend to 0xFFFFFFFF80000000, which is not equals to %zext0 + %zext2
You should check whether both LHS and RHS have more than 33 sign bits.

Hi @dtcxzyw I think because the const that is getting check is the second operand of and, so even if I have the following:

define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 0xFFFFFFFFA0000000 ; all upper 33 bits are 1
  %zext1 = and i64 %b, 0xFFFFFFFFA0000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}

It could also be true that %zext0=%zext2=0x40000000...... which still causing the same problem with sext..

dtcxzyw · 2025-03-11T03:03:39Z

How about sext operations with sign bit is 1?

Hi @shiltian , there may be problem with sext if I'm not wrong, using the following example:
define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 1073741824 ; 0x0000000040000000
  %zext1 = and i64 %b, 1073741824 ; 0x0000000040000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}
So %zext0 and %zext1 either going to equal to 0x0000000040000000 or 0 in 64 bit .
When %zext0=%zext2=0x40000000 then %zext0 + %zext2 = 0x0000000080000000
If I truncate both %zext0 and %zext1 into 32bit, then I have 0x40000000 and truncated add for 32 bit is 0x80000000
The 31'th bit is 1, with sext this will extend to 0xFFFFFFFF80000000, which is not equals to %zext0 + %zext2
You should check whether both LHS and RHS have more than 33 sign bits.
Hi @dtcxzyw I think because the const that is getting check is the second operand of and, so even if I have the following:
define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 0xFFFFFFFFA0000000 ; all upper 33 bits are 1
  %zext1 = and i64 %b, 0xFFFFFFFFA0000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}
It could also be true that %zext0=%zext2=0x40000000...... which still causing the same problem with sext..

https://alive2.llvm.org/ce/z/Bom92i Having 34 sign bits should work.

Shoreshen · 2025-03-11T03:22:34Z

Hi @dtcxzyw , I'm a little bit confusing about the link posted, are you saying that function @src is equivalent to function @tgt??

dtcxzyw · 2025-03-11T03:28:30Z

Hi @dtcxzyw , I'm a little bit confusing about the link posted, are you saying that function @src is equivalent to function @tgt??

No. I mean @src can be optimized into @tgt.

Shoreshen · 2025-03-11T03:39:53Z

Hi @dtcxzyw , I'm a little bit confusing about the link posted, are you saying that function @src is equivalent to function @tgt??

No. I mean @src can be optimized into @tgt.

Hi @dtcxzyw with trunc and sext yes, but with the cases added there is no trunc or sext in them.....

dtcxzyw · 2025-03-11T03:46:58Z

Hi @dtcxzyw , I'm a little bit confusing about the link posted, are you saying that function @src is equivalent to function @tgt??

No. I mean @src can be optimized into @tgt.

Hi @dtcxzyw with trunc and sext yes, but with the cases added there is no trunc or sext in them.....

We don't need trunc. These trunc instructions in @src are used to make sure that both %x and %y have at least 10 sign bits.

shiltian · 2025-03-11T03:48:55Z

How about sext operations with sign bit is 1?

Hi @shiltian , there may be problem with sext if I'm not wrong, using the following example:
define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 1073741824 ; 0x0000000040000000
  %zext1 = and i64 %b, 1073741824 ; 0x0000000040000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}
So %zext0 and %zext1 either going to equal to 0x0000000040000000 or 0 in 64 bit .

When %zext0=%zext2=0x40000000 then %zext0 + %zext2 = 0x0000000080000000

If I truncate both %zext0 and %zext1 into 32bit, then I have 0x40000000 and truncated add for 32 bit is 0x80000000

The 31'th bit is 1, with sext this will extend to 0xFFFFFFFF80000000, which is not equals to %zext0 + %zext2

I'd add test cases to make sure invalid cases are not combined.

Shoreshen · 2025-03-11T03:56:38Z

Hi @dtcxzyw , I'm a little bit confusing about the link posted, are you saying that function @src is equivalent to function @tgt??

No. I mean @src can be optimized into @tgt.

Hi @dtcxzyw with trunc and sext yes, but with the cases added there is no trunc or sext in them.....

We don't need trunc. These trunc instructions in @src are used to make sure that both %x and %y have at least 10 sign bits.

Hi @dtcxzyw , by trunc + sext, the original value can be changed. If %x is b'1000000 then sext(trunc(%x)) = b'11111...1000

What I'm trying to say is that if I'm not wrong, the case @src is not equivalent to case as follow:

define i16 @src(i16 %x, i16 %y) {
#0:
  %ax = and i16 %x to b'1111111
  %ay = and i16 %y to b'1111111
  %add = add i16 %ax, %ay
  ret i16 %add
}

Shoreshen · 2025-03-11T03:59:33Z

How about sext operations with sign bit is 1?

Hi @shiltian , there may be problem with sext if I'm not wrong, using the following example:
define i64 @narrow_add(i64 noundef %a, i64 noundef %b) {
  %zext0 = and i64 %a, 1073741824 ; 0x0000000040000000
  %zext1 = and i64 %b, 1073741824 ; 0x0000000040000000
  %add = add i64 %zext0, %zext1
  ret i64 %add
}
So %zext0 and %zext1 either going to equal to 0x0000000040000000 or 0 in 64 bit .
When %zext0=%zext2=0x40000000 then %zext0 + %zext2 = 0x0000000080000000
If I truncate both %zext0 and %zext1 into 32bit, then I have 0x40000000 and truncated add for 32 bit is 0x80000000
The 31'th bit is 1, with sext this will extend to 0xFFFFFFFF80000000, which is not equals to %zext0 + %zext2
I'd add test cases to make sure invalid cases are not combined.

Hi @shiltian , some of the no_narrow cases were added, I'll fix the comments and add some more cases. Thanks~

Shoreshen · 2025-03-21T01:26:25Z

Hi @arsenm @shiltian @RKSimon @dtcxzyw @nikic , just ask if there is any fix up I need for this PR. Thanks~

RKSimon

SGTM but a AMDGPU guru should approve

Shoreshen · 2025-03-27T01:04:56Z

Hi @arsenm @shiltian , just ask if there is any fix up I need for this PR. Thanks~

LU-JOHN · 2025-03-27T19:30:18Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  // Old cost
+  InstructionCost OldCost =
+      TTI.getArithmeticInstrCost(Opc, OldType, TTI::TCK_RecipThroughput);
+  // New cost of new op
+  InstructionCost NewCost =
+      TTI.getArithmeticInstrCost(Opc, NewType, TTI::TCK_RecipThroughput);
+  // New cost of narrowing 2 operands (use trunc)
+  NewCost += 2 * TTI.getCastInstrCost(Instruction::Trunc, NewType, OldType,
+                                      TTI.getCastContextHint(I),
+                                      TTI::TCK_RecipThroughput);
+  // New cost of zext narrowed result to original type
+  NewCost +=
+      TTI.getCastInstrCost(Instruction::ZExt, OldType, NewType,
+                           TTI.getCastContextHint(I), TTI::TCK_RecipThroughput);
+  if (NewCost >= OldCost)
+    return false;


I think this cost makes the transformation too conservative. Usually the truncs will be removed in the final code. Also, it does not include the benefit of using fewer registers with the narrower operations.

Hi @LU-JOHN , maybe yes, but currently I do have other cost strategy in my mind........ BTW, the cost of trunc is 0 from AMD backend, the narrowed arith is proportionally to the amount of bit narrowed. I'm not sure that if this took count in the saving of registers....

shiltian

LGTM though I'm not an AMDGPU guru.

llvm-ci · 2025-04-01T03:51:29Z

LLVM Buildbot has detected a new failure on builder lldb-remote-linux-ubuntu running on as-builder-9 while building llvm at step 16 "test-check-lldb-api".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/195/builds/6973