Skip to content

Commit

Permalink
[LV, VP]VP intrinsics support for the Loop Vectorizer
Browse files Browse the repository at this point in the history
This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

- The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
- The second way is to insert instructions to compute `min(VF, trip_count - index)` for each vector iteration.
- For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic `get_vector_length`, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

===Tentative Development Roadmap===

* Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
   1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
   2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
* Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Differential Revision: https://reviews.llvm.org/D99750
  • Loading branch information
alexey-bataev authored and ChunyuLiao committed Jan 3, 2024
1 parent 17afa5b commit 2c203bb
Show file tree
Hide file tree
Showing 24 changed files with 1,632 additions and 32 deletions.
5 changes: 4 additions & 1 deletion llvm/include/llvm/Analysis/TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,10 @@ enum class TailFoldingStyle {
/// Use predicate to control both data and control flow, but modify
/// the trip count so that a runtime overflow check can be avoided
/// and such that the scalar epilogue loop can always be removed.
DataAndControlFlowWithoutRuntimeCheck
DataAndControlFlowWithoutRuntimeCheck,
/// Use predicated EVL instructions for tail-folding.
/// Indicates that VP intrinsics should be used if tail-folding is enabled.
DataWithEVL,
};

struct TailFoldingInfo {
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
return TTI::TCC_Free;
}

bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
return ST->hasVInstructions();
}

TargetTransformInfo::PopcntSupportKind
RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
Expand Down
16 changes: 16 additions & 0 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
const APInt &Imm, Type *Ty,
TTI::TargetCostKind CostKind);

/// \name Vector Predication Information
/// Whether the target supports the %evl parameter of VP intrinsic efficiently
/// in hardware, for the given opcode and type/alignment. (see LLVM Language
/// Reference - "Vector Predication Intrinsics",
/// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
/// "IR-level VP intrinsics",
/// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
/// \param Opcode the opcode of the instruction checked for predicated version
/// support.
/// \param DataType the type of the instruction with the \p Opcode checked for
/// prediction support.
/// \param Alignment the alignment for memory access operation checked for
/// predicated version support.
bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
Align Alignment) const;

TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);

bool shouldExpandReduction(const IntrinsicInst *II) const;
Expand Down
160 changes: 151 additions & 9 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@
#include "llvm/IR/User.h"
#include "llvm/IR/Value.h"
#include "llvm/IR/ValueHandle.h"
#include "llvm/IR/VectorBuilder.h"
#include "llvm/IR/Verifier.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
Expand Down Expand Up @@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
"Create lane mask using active.lane.mask intrinsic, and use "
"it for both data and control flow"),
clEnumValN(
TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
"data-and-control-without-rt-check",
"Similar to data-and-control, but remove the runtime check")));
clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
"data-and-control-without-rt-check",
"Similar to data-and-control, but remove the runtime check"),
clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
"Use predicated EVL instructions for tail folding if the "
"target supports vector length predication")));

static cl::opt<bool> MaximizeBandwidth(
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
Expand Down Expand Up @@ -1105,9 +1108,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
// handled.
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
isa<VPInterleaveRecipe>(CurRec) ||
isa<VPScalarIVStepsRecipe>(CurRec) ||
isa<VPCanonicalIVPHIRecipe>(CurRec) ||
isa<VPActiveLaneMaskPHIRecipe>(CurRec))
isa<VPScalarIVStepsRecipe>(CurRec) || isa<VPHeaderPHIRecipe>(CurRec))
continue;

// This recipe contributes to the address computation of a widen
Expand Down Expand Up @@ -1655,6 +1656,23 @@ class LoopVectorizationCostModel {
return foldTailByMasking() || Legal->blockNeedsPredication(BB);
}

/// Returns true if VP intrinsics with explicit vector length support should
/// be generated in the tail folded loop.
bool useVPIWithVPEVLVectorization() const {
return PreferEVL && !EnableVPlanNativePath &&
getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
// FIXME: implement support for max safe dependency distance.
Legal->isSafeForAnyVectorWidth() &&
// FIXME: remove this once reductions are supported.
Legal->getReductionVars().empty() &&
// FIXME: remove this once vp_reverse is supported.
none_of(
WideningDecisions,
[](const std::pair<std::pair<Instruction *, ElementCount>,
std::pair<InstWidening, InstructionCost>>
&Data) { return Data.second.first == CM_Widen_Reverse; });
}

/// Returns true if the Phi is part of an inloop reduction.
bool isInLoopReduction(PHINode *Phi) const {
return InLoopReductions.contains(Phi);
Expand Down Expand Up @@ -1800,6 +1818,10 @@ class LoopVectorizationCostModel {
/// All blocks of loop are to be masked to fold tail of scalar iterations.
bool CanFoldTailByMasking = false;

/// Control whether to generate VP intrinsics with explicit-vector-length
/// support in vectorized code.
bool PreferEVL = false;

/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated
Expand Down Expand Up @@ -4883,6 +4905,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {
CanFoldTailByMasking = true;
if (getTailFoldingStyle() == TailFoldingStyle::None)
return MaxFactors;

if (UserIC > 1) {
LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
"not generate VP intrinsics since interleave count "
"specified is greater than 1.\n");
return MaxFactors;
}

if (MaxFactors.ScalableVF.isVector()) {
assert(MaxFactors.ScalableVF.isScalable() &&
"Expected scalable vector factor.");
// FIXME: use actual opcode/data type for analysis here.
PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
TTI.hasActiveVectorLength(0, nullptr, Align());
#if !NDEBUG
if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
if (PreferEVL)
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
"try to generate VP Intrinsics.\n";
else
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
"not try to generate VP Intrinsics since the target "
"does not support vector length predication.\n";
}
#endif // !NDEBUG

// Tail folded loop using VP intrinsics restricts the VF to be scalable.
if (PreferEVL)
MaxFactors.FixedVF = ElementCount::getFixed(1);
}

return MaxFactors;
}

Expand Down Expand Up @@ -5493,6 +5548,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
if (!isScalarEpilogueAllowed())
return 1;

// Do not interleave if EVL is preferred and no User IC is specified.
if (useVPIWithVPEVLVectorization())
return 1;

// We used the distance for the interleave count.
if (!Legal->isSafeForAnyVectorWidth())
return 1;
Expand Down Expand Up @@ -8622,6 +8681,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::truncateToMinimalBitwidths(
*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
VPlanTransforms::optimize(*Plan, *PSE.getSE());
if (CM.useVPIWithVPEVLVectorization())
VPlanTransforms::addExplicitVectorLength(*Plan);
assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
Expand Down Expand Up @@ -9454,6 +9515,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
}

/// Creates either vp_store or vp_scatter intrinsics calls to represent
/// predicated store/scatter.
static Instruction *
lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
Value *StoredVal, bool IsScatter, Value *Mask,
Value *EVLPart, const Align &Alignment) {
CallInst *Call;
if (IsScatter) {
Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
Intrinsic::vp_scatter,
{StoredVal, Addr, Mask, EVLPart});
} else {
VectorBuilder VBuilder(Builder);
VBuilder.setEVL(EVLPart).setMask(Mask);
Call = cast<CallInst>(VBuilder.createVectorInstruction(
Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
{StoredVal, Addr}));
}
Call->addParamAttr(
1, Attribute::getWithAlignment(Call->getContext(), Alignment));
return Call;
}

/// Creates either vp_load or vp_gather intrinsics calls to represent
/// predicated load/gather.
static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
VectorType *DataTy,
Value *Addr, bool IsGather,
Value *Mask, Value *EVLPart,
const Align &Alignment) {
CallInst *Call;
if (IsGather) {
Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
{Addr, Mask, EVLPart}, nullptr,
"wide.masked.gather");
} else {
VectorBuilder VBuilder(Builder);
VBuilder.setEVL(EVLPart).setMask(Mask);
Call = cast<CallInst>(VBuilder.createVectorInstruction(
Instruction::Load, DataTy, Addr, "vp.op.load"));
}
Call->addParamAttr(
0, Attribute::getWithAlignment(Call->getContext(), Alignment));
return Call;
}

void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;

Expand Down Expand Up @@ -9523,14 +9630,35 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
return PartPtr;
};

auto MaskValue = [&](unsigned Part) -> Value * {
if (isMaskRequired)
return BlockInMaskParts[Part];
return nullptr;
};

// Handle Stores:
if (SI) {
State.setDebugLocFrom(SI->getDebugLoc());

for (unsigned Part = 0; Part < State.UF; ++Part) {
Instruction *NewSI = nullptr;
Value *StoredVal = State.get(StoredValue, Part);
if (CreateGatherScatter) {
if (State.EVL) {
Value *EVLPart = State.get(State.EVL, Part);
// If EVL is not nullptr, then EVL must be a valid value set during plan
// creation, possibly default value = whole vector register length. EVL
// is created only if TTI prefers predicated vectorization, thus if EVL
// is not nullptr it also implies preference for predicated
// vectorization.
// FIXME: Support reverse store after vp_reverse is added.
NewSI = lowerStoreUsingVectorIntrinsics(
Builder,
CreateGatherScatter
? State.get(getAddr(), Part)
: CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
Alignment);
} else if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(getAddr(), Part);
NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
Expand Down Expand Up @@ -9561,7 +9689,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
State.setDebugLocFrom(LI->getDebugLoc());
for (unsigned Part = 0; Part < State.UF; ++Part) {
Value *NewLI;
if (CreateGatherScatter) {
if (State.EVL) {
Value *EVLPart = State.get(State.EVL, Part);
// If EVL is not nullptr, then EVL must be a valid value set during plan
// creation, possibly default value = whole vector register length. EVL
// is created only if TTI prefers predicated vectorization, thus if EVL
// is not nullptr it also implies preference for predicated
// vectorization.
// FIXME: Support reverse loading after vp_reverse is added.
NewLI = lowerLoadUsingVectorIntrinsics(
Builder, DataTy,
CreateGatherScatter
? State.get(getAddr(), Part)
: CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
} else if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(getAddr(), Part);
NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
Expand Down
43 changes: 43 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,12 @@ struct VPTransformState {
ElementCount VF;
unsigned UF;

/// If EVL is not nullptr, then EVL must be a valid value set during plan
/// creation, possibly a default value = whole vector register length. EVL is
/// created only if TTI prefers predicated vectorization, thus if EVL is
/// not nullptr it also implies preference for predicated vectorization.
VPValue *EVL = nullptr;

/// Hold the indices to generate specific scalar instructions. Null indicates
/// that all instances are to be generated, using either scalar or vector
/// instructions.
Expand Down Expand Up @@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
SLPLoad,
SLPStore,
ActiveLaneMask,
ExplicitVectorLength,
ExplicitVectorLengthIVIncrement,
CalculateTripCountMinusVF,
// Increment the canonical IV separately for each unrolled part.
CanonicalIVIncrementForPart,
Expand Down Expand Up @@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
default:
return false;
case VPInstruction::ActiveLaneMask:
case VPInstruction::ExplicitVectorLength:
case VPInstruction::ExplicitVectorLengthIVIncrement:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::BranchOnCount:
Expand Down Expand Up @@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
#endif
};

/// A recipe for generating the phi node for the current index of elements,
/// adjusted in accordance with EVL value. It starts at StartIV value and gets
/// incremented by EVL in each iteration of the vector loop.
class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
public:
VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
: VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}

~VPEVLBasedIVPHIRecipe() override = default;

VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)

static inline bool classof(const VPHeaderPHIRecipe *D) {
return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
}

/// Generate phi for handling IV based on EVL over iterations correctly.
void execute(VPTransformState &State) override;

/// Returns true if the recipe only uses the first lane of operand \p Op.
bool onlyFirstLaneUsed(const VPValue *Op) const override {
assert(is_contained(operands(), Op) &&
"Op must be an operand of the recipe");
return true;
}

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const override;
#endif
};

/// A Recipe for widening the canonical induction variable of the vector loop.
class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
public:
Expand Down
16 changes: 8 additions & 8 deletions llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
Type *ResultTy =
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
.Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
[this](const auto *R) {
// Handle header phi recipes, except VPWienIntOrFpInduction
// which needs special handling due it being possibly truncated.
// TODO: consider inferring/caching type of siblings, e.g.,
// backedge value, here and in cases below.
return inferScalarType(R->getStartValue());
})
VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
VPEVLBasedIVPHIRecipe>([this](const auto *R) {
// Handle header phi recipes, except VPWienIntOrFpInduction
// which needs special handling due it being possibly truncated.
// TODO: consider inferring/caching type of siblings, e.g.,
// backedge value, here and in cases below.
return inferScalarType(R->getStartValue());
})
.Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
[](const auto *R) { return R->getScalarType(); })
.Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
Expand Down
Loading

0 comments on commit 2c203bb

Please sign in to comment.