[AMDGPU] Enable "amdgpu-uniform-intrinsic-combine" pass in pipeline. #128687

PankajDwivedi-25 · 2025-02-25T10:08:07Z

This PR enables "amdgpu-uniform-intrinsic-combine" pass in the pipeline.
Also introduces "amdgpu-enable-uniform-intrinsic-combine" cmd line flag to enbale/disable the pass.

This PR will be rebased and merged once PR #116953 is merged.

llvmbot · 2025-02-25T10:08:29Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pankaj Dwivedi (PankajDwivedi-25)

Changes

This PR enables "amdgpu-uniform-intrinsic-combine" pass in the pipeline.
Also introduces "amdgpu-enable-uniform-intrinsic-combine" cmd line flag to enbale/disable the pass.

This PR will be merged once PR #116953 is merged.

Patch is 163.84 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128687.diff

21 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+11)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+12-1)
(added) llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp (+164)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll (+19-23)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll (+21-25)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-miscellaneous-uniform-intrinsics.ll (+131)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-simplify-uniform-waterfall.ll (+170)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll (+453)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-uniform-temporal-divergence.ll (+57)
(modified) llvm/test/CodeGen/AMDGPU/fsqrt.f64.ll (+536-4)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+23)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll (+20-19)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i64.ll (+20-19)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ll (+34-23)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+23-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+14-17)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readlane.ll (+16-19)
(modified) llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr-update-regscavenger.ll (+9-14)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+107-114)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 428355a739628..7e628e126c983 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -466,6 +466,17 @@ extern char &GCNRewritePartialRegUsesID;
 void initializeAMDGPUWaitSGPRHazardsLegacyPass(PassRegistry &);
 extern char &AMDGPUWaitSGPRHazardsLegacyID;
 
+void initializeAMDGPUUniformIntrinsicCombineLegacyPass(PassRegistry &);
+extern char &AMDGPUUniformIntrinsicCombineLegacyPassID;
+FunctionPass *createAMDGPUUniformIntrinsicCombineLegacyPass();
+
+struct AMDGPUUniformIntrinsicCombinePass
+    : public PassInfoMixin<AMDGPUUniformIntrinsicCombinePass> {
+  const AMDGPUTargetMachine &TM;
+  AMDGPUUniformIntrinsicCombinePass(const AMDGPUTargetMachine &TM) : TM(TM) {}
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
 namespace AMDGPU {
 enum TargetIndex {
   TI_CONSTDATA_START,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index e89d84c8a105f..02a121a2b18b5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -67,6 +67,7 @@ FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
               AMDGPUUnifyDivergentExitNodesPass())
 FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
 FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("amdgpu-uniform-intrinsic-combine", AMDGPUUniformIntrinsicCombinePass(*this))
 #undef FUNCTION_PASS
 
 #ifndef FUNCTION_ANALYSIS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 828c1702ae07a..046a72c74383a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -477,6 +477,11 @@ static cl::opt<bool> HasClosedWorldAssumption(
     cl::desc("Whether has closed-world assumption at link time"),
     cl::init(false), cl::Hidden);
 
+static cl::opt<bool> EnableUniformIntrinsicCombine(
+    "amdgpu-enable-uniform-intrinsic-combine",
+    cl::desc("Enable/Disable the Uniform Intrinsic Combine Pass"),
+    cl::init(true), cl::Hidden);
+
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   // Register the target
   RegisterTargetMachine<R600TargetMachine> X(getTheR600Target());
@@ -560,6 +565,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeGCNRegPressurePrinterPass(*PR);
   initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
   initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+  initializeAMDGPUUniformIntrinsicCombineLegacyPass(*PR);
 }
 
 static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
@@ -829,13 +835,16 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
       });
 
   PB.registerPeepholeEPCallback(
-      [](FunctionPassManager &FPM, OptimizationLevel Level) {
+      [this](FunctionPassManager &FPM, OptimizationLevel Level) {
         if (Level == OptimizationLevel::O0)
           return;
 
         FPM.addPass(AMDGPUUseNativeCallsPass());
         if (EnableLibCallSimplify)
           FPM.addPass(AMDGPUSimplifyLibCallsPass());
+
+        if (EnableUniformIntrinsicCombine)
+          FPM.addPass(AMDGPUUniformIntrinsicCombinePass(*this));
       });
 
   PB.registerCGSCCOptimizerLateEPCallback(
@@ -1213,6 +1222,8 @@ void AMDGPUPassConfig::addIRPasses() {
   if (isPassEnabled(EnableImageIntrinsicOptimizer))
     addPass(createAMDGPUImageIntrinsicOptimizerPass(&TM));
 
+  if (EnableUniformIntrinsicCombine)
+    addPass(createAMDGPUUniformIntrinsicCombineLegacyPass());
   // This can be disabled by passing ::Disable here or on the command line
   // with --expand-variadics-override=disable.
   addPass(createExpandVariadicsPass(ExpandVariadicsMode::Lowering));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp
new file mode 100644
index 0000000000000..6074eaeac3bf0
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp
@@ -0,0 +1,164 @@
+//===-- AMDGPUUniformIntrinsicCombine.cpp ---------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This pass simplifies certain intrinsic calls when the arguments are uniform.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/Analysis/ScalarEvolution.h"
+#include "llvm/Analysis/TargetLibraryInfo.h"
+#include "llvm/Analysis/UniformityAnalysis.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstIterator.h"
+#include "llvm/IR/InstVisitor.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Target/TargetMachine.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+
+#define DEBUG_TYPE "amdgpu-uniform-intrinsic-combine"
+
+using namespace llvm;
+using namespace llvm::AMDGPU;
+using namespace llvm::PatternMatch;
+
+namespace {
+class AMDGPUUniformIntrinsicCombineLegacy : public FunctionPass {
+public:
+  static char ID;
+  AMDGPUUniformIntrinsicCombineLegacy() : FunctionPass(ID) {
+    initializeAMDGPUUniformIntrinsicCombineLegacyPass(
+        *PassRegistry::getPassRegistry());
+  }
+  bool runOnFunction(Function &F) override;
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<UniformityInfoWrapperPass>();
+    AU.addRequired<TargetPassConfig>();
+  }
+};
+
+class AMDGPUUniformIntrinsicCombineImpl
+    : public InstVisitor<AMDGPUUniformIntrinsicCombineImpl> {
+private:
+  const UniformityInfo *UI;
+  bool optimizeUniformIntrinsicInst(IntrinsicInst &II) const;
+
+public:
+  AMDGPUUniformIntrinsicCombineImpl() = delete;
+  AMDGPUUniformIntrinsicCombineImpl(const UniformityInfo *UI) : UI(UI) {}
+  bool run(Function &F);
+};
+} // namespace
+
+char AMDGPUUniformIntrinsicCombineLegacy::ID = 0;
+char &llvm::AMDGPUUniformIntrinsicCombineLegacyPassID =
+    AMDGPUUniformIntrinsicCombineLegacy::ID;
+
+bool AMDGPUUniformIntrinsicCombineLegacy::runOnFunction(Function &F) {
+  if (skipFunction(F)) {
+    return false;
+  }
+  const UniformityInfo *UI =
+      &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
+  return AMDGPUUniformIntrinsicCombineImpl(UI).run(F);
+}
+
+PreservedAnalyses
+AMDGPUUniformIntrinsicCombinePass::run(Function &F,
+                                       FunctionAnalysisManager &AM) {
+  const auto *UI = &AM.getResult<UniformityInfoAnalysis>(F);
+  bool IsChanged = AMDGPUUniformIntrinsicCombineImpl(UI).run(F);
+
+  if (!IsChanged) {
+    return PreservedAnalyses::all();
+  }
+  PreservedAnalyses PA;
+  PA.preserve<DominatorTreeAnalysis>();
+  PA.preserve<LoopAnalysis>();
+  PA.preserve<ScalarEvolutionAnalysis>();
+  PA.preserve<UniformityInfoAnalysis>();
+  PA.preserve<TargetLibraryAnalysis>();
+  return PA;
+}
+
+bool AMDGPUUniformIntrinsicCombineImpl::run(Function &F) {
+  bool IsChanged{false};
+
+  // Iterate over each instruction in the function to get the desired intrinsic
+  // inst to check for optimization.
+  for (Instruction &I : make_early_inc_range(instructions(F))) {
+    if (auto *Intrinsic = dyn_cast<IntrinsicInst>(&I)) {
+      IsChanged |= optimizeUniformIntrinsicInst(*Intrinsic);
+    }
+  }
+  return IsChanged;
+}
+
+bool AMDGPUUniformIntrinsicCombineImpl::optimizeUniformIntrinsicInst(
+    IntrinsicInst &II) const {
+  llvm::Intrinsic::ID IID = II.getIntrinsicID();
+
+  switch (IID) {
+  case Intrinsic::amdgcn_permlane64:
+  case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readlane: {
+    Value *Src = II.getArgOperand(0);
+    // Check if the argument use is divergent
+    if (UI->isDivergentUse(II.getOperandUse(0)))
+      return false;
+    LLVM_DEBUG(dbgs() << "Replacing " << II << " with " << *Src << "\n");
+    II.replaceAllUsesWith(Src);
+    return true;
+  }
+  case Intrinsic::amdgcn_ballot: {
+    Value *Src = II.getArgOperand(0);
+    if (UI->isDivergentUse(II.getOperandUse(0)))
+      return false;
+    LLVM_DEBUG(dbgs() << "Found uniform ballot intrinsic: " << II << "\n");
+
+    bool Changed = false;
+    for (User *U : make_early_inc_range(II.users())) {
+      if (auto *ICmp = dyn_cast<ICmpInst>(U)) {
+        Value *Op0 = ICmp->getOperand(0);
+        Value *Op1 = ICmp->getOperand(1);
+
+        if (ICmp->getPredicate() == ICmpInst::ICMP_EQ &&
+            ((Op0 == &II && match(Op1, m_Zero())) ||
+             (Op1 == &II && match(Op0, m_Zero())))) {
+
+          IRBuilder<> Builder(ICmp);
+          Value *Xor = Builder.CreateXor(Src, Builder.getTrue());
+          LLVM_DEBUG(dbgs() << "Replacing with XOR: " << *Xor << "\n");
+          ICmp->replaceAllUsesWith(Xor);
+          Changed = true;
+        }
+      }
+    }
+    return Changed;
+  }
+  }
+  return false;
+}
+
+INITIALIZE_PASS_BEGIN(AMDGPUUniformIntrinsicCombineLegacy, DEBUG_TYPE,
+                      "AMDGPU uniformIntrinsic Combine", false, false)
+INITIALIZE_PASS_DEPENDENCY(UniformityInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
+INITIALIZE_PASS_END(AMDGPUUniformIntrinsicCombineLegacy, DEBUG_TYPE,
+                    "AMDGPU uniformIntrinsic Combine", false, false)
+
+FunctionPass *llvm::createAMDGPUUniformIntrinsicCombineLegacyPass() {
+  return new AMDGPUUniformIntrinsicCombineLegacy();
+}
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 408da0536237e..620037dca1c2b 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -61,6 +61,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUHSAMetadataStreamer.cpp
   AMDGPUInsertDelayAlu.cpp
   AMDGPUInstCombineIntrinsic.cpp
+  AMDGPUUniformIntrinsicCombine.cpp
   AMDGPUInstrInfo.cpp
   AMDGPUInstructionSelector.cpp
   AMDGPUISelDAGToDAG.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
index 927a31d3992b0..612842758f38a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
@@ -162,16 +162,17 @@ false:
 define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
 ; CHECK-LABEL: branch_uniform_ballot_eq_zero_non_compare:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    s_and_b32 s0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_xor_b32 s0, s0, 1
+; CHECK-NEXT:    s_xor_b32 s0, s0, 1
+; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc0 .LBB10_2
-; CHECK-NEXT:  ; %bb.1: ; %false
-; CHECK-NEXT:    s_mov_b32 s0, 33
-; CHECK-NEXT:    s_branch .LBB10_3
-; CHECK-NEXT:  .LBB10_2: ; %true
+; CHECK-NEXT:    s_cbranch_scc1 .LBB10_2
+; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB10_3
+; CHECK-NEXT:  .LBB10_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB10_3
 ; CHECK-NEXT:  .LBB10_3:
   %c = trunc i32 %v to i1
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
@@ -259,17 +260,13 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
 ; CHECK-LABEL: branch_uniform_ballot_eq_zero_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
-; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
-; CHECK-NEXT:    s_and_b32 s0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
-; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc0 .LBB14_2
-; CHECK-NEXT:  ; %bb.1: ; %false
-; CHECK-NEXT:    s_mov_b32 s0, 33
-; CHECK-NEXT:    s_branch .LBB14_3
-; CHECK-NEXT:  .LBB14_2: ; %true
+; CHECK-NEXT:    s_cbranch_scc1 .LBB14_2
+; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB14_3
+; CHECK-NEXT:  .LBB14_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB14_3
 ; CHECK-NEXT:  .LBB14_3:
   %c = icmp ult i32 %v, 12
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
@@ -374,16 +371,15 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg
 ; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
 ; CHECK-NEXT:    s_cselect_b32 s1, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, s0, s1
-; CHECK-NEXT:    s_and_b32 s0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc0 .LBB18_2
-; CHECK-NEXT:  ; %bb.1: ; %false
-; CHECK-NEXT:    s_mov_b32 s0, 33
-; CHECK-NEXT:    s_branch .LBB18_3
-; CHECK-NEXT:  .LBB18_2: ; %true
+; CHECK-NEXT:    s_cbranch_scc1 .LBB18_2
+; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB18_3
+; CHECK-NEXT:  .LBB18_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB18_3
 ; CHECK-NEXT:  .LBB18_3:
   %v1c = icmp ult i32 %v1, 12
   %v2c = icmp ugt i32 %v2, 34
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
index 0bbb40b8db43a..358306a43e08c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
@@ -165,16 +165,17 @@ false:
 define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
 ; CHECK-LABEL: branch_uniform_ballot_eq_zero_non_compare:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    s_and_b32 s0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
-; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; CHECK-NEXT:    s_cbranch_scc0 .LBB10_2
-; CHECK-NEXT:  ; %bb.1: ; %false
-; CHECK-NEXT:    s_mov_b32 s0, 33
-; CHECK-NEXT:    s_branch .LBB10_3
-; CHECK-NEXT:  .LBB10_2: ; %true
+; CHECK-NEXT:    s_xor_b32 s0, s0, 1
+; CHECK-NEXT:    s_xor_b32 s0, s0, 1
+; CHECK-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB10_2
+; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB10_3
+; CHECK-NEXT:  .LBB10_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB10_3
 ; CHECK-NEXT:  .LBB10_3:
   %c = trunc i32 %v to i1
   %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %c)
@@ -262,17 +263,13 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
 ; CHECK-LABEL: branch_uniform_ballot_eq_zero_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
-; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
-; CHECK-NEXT:    s_and_b32 s0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
-; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; CHECK-NEXT:    s_cbranch_scc0 .LBB14_2
-; CHECK-NEXT:  ; %bb.1: ; %false
-; CHECK-NEXT:    s_mov_b32 s0, 33
-; CHECK-NEXT:    s_branch .LBB14_3
-; CHECK-NEXT:  .LBB14_2: ; %true
+; CHECK-NEXT:    s_cbranch_scc1 .LBB14_2
+; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB14_3
+; CHECK-NEXT:  .LBB14_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB14_3
 ; CHECK-NEXT:  .LBB14_3:
   %c = icmp ult i32 %v, 12
   %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %c)
@@ -377,16 +374,15 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg
 ; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
 ; CHECK-NEXT:    s_cselect_b32 s1, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, s0, s1
-; CHECK-NEXT:    s_and_b32 s0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
-; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; CHECK-NEXT:    s_cbranch_scc0 .LBB18_2
-; CHECK-NEXT:  ; %bb.1: ; %false
-; CHECK-NEXT:    s_mov_b32 s0, 33
-; CHECK-NEXT:    s_branch .LBB18_3
-; CHECK-NEXT:  .LBB18_2: ; %true
+; CHECK-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB18_2
+; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB18_3
+; CHECK-NEXT:  .LBB18_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB18_3
 ; CHECK-NEXT:  .LBB18_3:
   %v1c = icmp ult i32 %v1, 12
   %v2c = icmp ugt i32 %v2, 34
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-miscellaneous-uniform-intrinsics.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-miscellaneous-uniform-intrinsics.ll
new file mode 100644
index 0000000000000..f450b0e6763c4
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-miscellaneous-uniform-intrinsics.ll
@@ -0,0 +1,131 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -verify-machineinstrs -o - %s | FileCheck %s
+
+define amdgpu_kernel void @readfirstlane_with_readfirstlane(ptr addrspace(1) %out) {
+; CHECK-LABEL: readfirstlane_with_readfirstlane:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 5
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    global_store_b32 v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %v1 = call i32 @llvm.amdgcn.readfirstlane(i32 5)
+  %v2 = call i32 @llvm.amdgcn.readfirstlane(i32 %v1)
+  store i32 %v2, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @readfirstlane_with_readlane(ptr addrspace(1) %out) {
+; CHECK-LABEL: readfirstlane_with_readlane:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; CHECK-NEXT:    v_bfe_u32 v1, v0, 10, 10
+; CHECK-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT:    v_readfirstlane_b32 s2, v1
+; CHECK-NEXT:    v_readlane_b32 s2, v0, s2
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    global_store_b32 v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %tidx = call i32 @llvm.amdgcn.workitem.id.x()
+  %tidy = call i32 @llvm.amdgcn.workitem.id.y()
+  %v1 = call i32 @llvm.amdgcn.readlane(i32 %tidx, i32 %tidy)
+  %v2 = call i32 @llvm.amdgcn.readfirstlane(i32 %v1)
+  store i32 %v2, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @readlane_with_firstlane(ptr addrspace(1) %out) {
+; CHECK-LABEL: readlane_with_firstlane:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; CHECK-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT:    v_readfirstlane_b32 s2, v0
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    global_store_b32 v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %tidx = call i32 @llvm.amdgcn.workitem.id.x()
+  %v1 = call i32 @llvm.amdgcn.readfirstlane(i32 %tidx)
+  %v2 = call i32 @llvm.amdgcn.readlane(i32 %v1, i32 3)
+  store i32 %v2, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @readlane_readlane(ptr addrspace(1) %out) {
+; CHECK-LABEL: readlane_readlane:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; CHECK-NEXT:    v_bfe_u32 v1, v0, 10, 10
+; CHECK-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT:    v_readfirstlane_b32 s2, v1
+; CHECK-NEXT:    v_readlane_b32 s2, v0, s2
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    global_store_b32 v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %tidx = call i32 @llvm.amdgcn.workitem.id.x()
+  %tidy = call i32 @llvm.amdgcn.workitem.id.y()
+  %v1 = call i32 @llvm.amdgcn.readlane(i32 %tidx, i32 %tidy)
+  %v2 = call i32 @llvm.amdgcn.readlane(i32 %v1, i32 2)
+  store i32 %v2, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @pe...
[truncated]

github-actions · 2025-02-25T10:11:43Z

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:

git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 58035b5ef5bbdf121d7083c64ac5b4eb04abbeaa a6aef830521d7c510c6a0fe9e3b672fd6cfc24b4 llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp llvm/test/CodeGen/AMDGPU/amdgpu-miscellaneous-uniform-intrinsics.ll llvm/test/CodeGen/AMDGPU/amdgpu-simplify-uniform-waterfall.ll llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll llvm/test/CodeGen/AMDGPU/amdgpu-uniform-temporal-divergence.ll llvm/lib/Target/AMDGPU/AMDGPU.h llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/fmaximum3.ll llvm/test/CodeGen/AMDGPU/fminimum3.ll llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll llvm/test/CodeGen/AMDGPU/fsqrt.f32.ll llvm/test/CodeGen/AMDGPU/fsqrt.f64.ll llvm/test/CodeGen/AMDGPU/llc-pipeline.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i64.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readlane.ll llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr-update-regscavenger.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll

The following files introduce new uses of undef:

llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

cdevadas · 2025-02-25T15:31:57Z

This PR should be stacked with #116953. It is hard to review the changes introduced by this PR. It currently combines both.

cdevadas · 2025-02-25T15:39:35Z

llvm/test/CodeGen/AMDGPU/bf16.ll

@@ -1,3 +1,4 @@
+; XFAIL: *  


Can you state with an example how ISel currently introduces an illegal COPY?
It mostly needs a fix to introduce readfirstlane in the first place, instead of illegal VGPR to SGPR copy.

PankajDwivedi-25 and others added 17 commits February 24, 2025 13:03

[WIP][AMDGPU] combine uniform AMDGPU lane Intrinsics

c2f5258

refactored and updated intrinsics handling

91dd6d9

removed redundant casting

c98a967

refactored, added more test

ef78ece

integrated in pipeline, more test added

d3ab828

removed unused gfx checks

f5edd78

added pass to llc pipeline, more test added

242afb1

handled ballot with icmp for trivial waterfall

69dd0aa

updated test

be7a5a2

Fix: use isDivergentUse instead isUniform

16bc3a8

addressed reviews

b3c4e93

pull the ballot argument to all the match users

94a4787

Match and replace icmp ballot,0 with XOR

7d0af28

Rebase: resolve merge

809ffb5

remove undef test

11f6c0b

added pass to llc pipeline

13023a4

Update test checks

b4e8b93

PankajDwivedi-25 added the backend:AMDGPU label Feb 25, 2025

PankajDwivedi-25 requested review from arsenm and ssahasra February 25, 2025 10:08

PankajDwivedi-25 closed this Feb 25, 2025

PankajDwivedi-25 reopened this Feb 25, 2025

PankajDwivedi-25 requested review from jayfoad and cdevadas February 25, 2025 13:57

expectedly failing test, illegal COPY introduction by isel

a6aef83

cdevadas reviewed Feb 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Enable "amdgpu-uniform-intrinsic-combine" pass in pipeline. #128687

[AMDGPU] Enable "amdgpu-uniform-intrinsic-combine" pass in pipeline. #128687

PankajDwivedi-25 commented Feb 25, 2025 •

edited

Loading

llvmbot commented Feb 25, 2025

github-actions bot commented Feb 25, 2025 •

edited

Loading

cdevadas commented Feb 25, 2025

cdevadas Feb 25, 2025

[AMDGPU] Enable "amdgpu-uniform-intrinsic-combine" pass in pipeline. #128687

Are you sure you want to change the base?

[AMDGPU] Enable "amdgpu-uniform-intrinsic-combine" pass in pipeline. #128687

Conversation

PankajDwivedi-25 commented Feb 25, 2025 • edited Loading

llvmbot commented Feb 25, 2025

github-actions bot commented Feb 25, 2025 • edited Loading

cdevadas commented Feb 25, 2025

cdevadas Feb 25, 2025

Choose a reason for hiding this comment

PankajDwivedi-25 commented Feb 25, 2025 •

edited

Loading

github-actions bot commented Feb 25, 2025 •

edited

Loading