-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AArch64][CostModel] Improve cost estimate of scalarizing a vector di… #118055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-systemz @llvm/pr-subscribers-backend-powerpc Author: Sushant Gokhale (sushgokh) Changes…vision In the backend, last resort of finding the vector division cost is to use its scalar cost. However, without knowledge about the division operands, the cost can be off in certain cases. For SLP, this patch tries to pass scalars for better scalar cost estimation in the backend. Patch is 93.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118055.diff 29 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 985ca1532e0149..50f1210e82f1e2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1297,13 +1297,16 @@ class TargetTransformInfo {
/// provide even more information.
/// \p TLibInfo is used to search for platform specific vector library
/// functions for instructions that might be converted to calls (e.g. frem).
+ /// \p Scalars refers to individual scalars/instructions being used for
+ /// vectorization.
InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput,
TTI::OperandValueInfo Opd1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Opd2Info = {TTI::OK_AnyValue, TTI::OP_None},
ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
- const TargetLibraryInfo *TLibInfo = nullptr) const;
+ const TargetLibraryInfo *TLibInfo = nullptr,
+ ArrayRef<Value *> Scalars = {}) const;
/// Returns the cost estimation for alternating opcode pattern that can be
/// lowered to a single instruction on the target. In X86 this is for the
@@ -2099,7 +2102,8 @@ class TargetTransformInfo::Concept {
virtual InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueInfo Opd1Info, OperandValueInfo Opd2Info,
- ArrayRef<const Value *> Args, const Instruction *CxtI = nullptr) = 0;
+ ArrayRef<const Value *> Args, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) = 0;
virtual InstructionCost getAltInstrCost(
VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
const SmallBitVector &OpcodeMask,
@@ -2780,10 +2784,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueInfo Opd1Info, OperandValueInfo Opd2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI = nullptr) override {
+ ArrayRef<const Value *> Args, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) override {
return Impl.getArithmeticInstrCost(Opcode, Ty, CostKind, Opd1Info, Opd2Info,
- Args, CxtI);
+ Args, CxtI, Scalars);
}
InstructionCost getAltInstrCost(VectorType *VecTy, unsigned Opcode0,
unsigned Opcode1,
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 38aba183f6a173..4aa255a909edbf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -581,11 +581,13 @@ class TargetTransformInfoImplBase {
unsigned getMaxInterleaveFactor(ElementCount VF) const { return 1; }
- InstructionCost getArithmeticInstrCost(
- unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
- TTI::OperandValueInfo Opd1Info, TTI::OperandValueInfo Opd2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI = nullptr) const {
+ InstructionCost getArithmeticInstrCost(unsigned Opcode, Type *Ty,
+ TTI::TargetCostKind CostKind,
+ TTI::OperandValueInfo Opd1Info,
+ TTI::OperandValueInfo Opd2Info,
+ ArrayRef<const Value *> Args,
+ const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) const {
// Widenable conditions will eventually lower into constants, so some
// operations with them will be trivially optimized away.
auto IsWidenableCondition = [](const Value *V) {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 98cbb4886642bf..4a27470763db83 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -924,7 +924,8 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Opd1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Opd2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr) {
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) {
// Check if any of the operands are vector operands.
const TargetLoweringBase *TLI = getTLI();
int ISD = TLI->InstructionOpcodeToISD(Opcode);
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 1fb2b9836de0cc..3f8745b1e06ecc 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -927,7 +927,7 @@ InstructionCost TargetTransformInfo::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueInfo Op1Info, OperandValueInfo Op2Info,
ArrayRef<const Value *> Args, const Instruction *CxtI,
- const TargetLibraryInfo *TLibInfo) const {
+ const TargetLibraryInfo *TLibInfo, ArrayRef<Value *> Scalars) const {
// Use call cost for frem intructions that have platform specific vector math
// functions, as those will be replaced with calls later by SelectionDAG or
@@ -942,10 +942,8 @@ InstructionCost TargetTransformInfo::getArithmeticInstrCost(
return getCallInstrCost(nullptr, VecTy, {VecTy, VecTy}, CostKind);
}
- InstructionCost Cost =
- TTIImpl->getArithmeticInstrCost(Opcode, Ty, CostKind,
- Op1Info, Op2Info,
- Args, CxtI);
+ InstructionCost Cost = TTIImpl->getArithmeticInstrCost(
+ Opcode, Ty, CostKind, Op1Info, Op2Info, Args, CxtI, Scalars);
assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;
}
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index d1536a276a9040..f9434b1691d746 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3376,8 +3376,8 @@ InstructionCost AArch64TTIImpl::getScalarizationOverhead(
InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// The code-generator is currently not able to handle scalable vectors
// of <vscale x 1 x eltty> yet, so return an invalid cost to avoid selecting
@@ -3442,8 +3442,8 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
if (!VT.isVector() && VT.getSizeInBits() > 64)
return getCallInstrCost(/*Function*/ nullptr, Ty, {Ty, Ty}, CostKind);
- InstructionCost Cost = BaseT::getArithmeticInstrCost(
- Opcode, Ty, CostKind, Op1Info, Op2Info);
+ InstructionCost Cost =
+ BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info);
if (Ty->isVectorTy()) {
if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) {
// SDIV/UDIV operations are lowered using SVE, then we can have less
@@ -3472,29 +3472,41 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
Cost *= 4;
return Cost;
} else {
- // If one of the operands is a uniform constant then the cost for each
- // element is Cost for insertion, extraction and division.
- // Insertion cost = 2, Extraction Cost = 2, Division = cost for the
- // operation with scalar type
- if ((Op1Info.isConstant() && Op1Info.isUniform()) ||
- (Op2Info.isConstant() && Op2Info.isUniform())) {
- if (auto *VTy = dyn_cast<FixedVectorType>(Ty)) {
+ if (auto *VTy = dyn_cast<FixedVectorType>(Ty)) {
+ if ((Op1Info.isConstant() && Op1Info.isUniform()) ||
+ (Op2Info.isConstant() && Op2Info.isUniform())) {
InstructionCost DivCost = BaseT::getArithmeticInstrCost(
Opcode, Ty->getScalarType(), CostKind, Op1Info, Op2Info);
- return (4 + DivCost) * VTy->getNumElements();
+ // If #vector_elements = n then we need
+ // n inserts + 2n extracts + n divisions.
+ InstructionCost InsertExtractCost =
+ ST->getVectorInsertExtractBaseCost();
+ Cost = (3 * InsertExtractCost + DivCost) * VTy->getNumElements();
+ } else if (!Scalars.empty()) {
+ // If #vector_elements = n then we need
+ // n inserts + 2n extracts + n divisions.
+ InstructionCost InsertExtractCost =
+ ST->getVectorInsertExtractBaseCost();
+ Cost = (3 * InsertExtractCost) * VTy->getNumElements();
+ for (auto *V : Scalars) {
+ auto *I = cast<Instruction>(V);
+ Cost +=
+ getArithmeticInstrCost(I->getOpcode(), I->getType(), CostKind,
+ TTI::getOperandInfo(I->getOperand(0)),
+ TTI::getOperandInfo(I->getOperand(1)));
+ }
+ } else {
+ // FIXME: The initial cost calculated should have considered extract
+ // cost twice. For now, we just add additional cost to avoid
+ // underestimating the total cost.
+ Cost += Cost;
}
+ } else {
+ // We can't predict the cost of div/extract/insert without knowing the
+ // vector width.
+ Cost.setInvalid();
}
- // On AArch64, without SVE, vector divisions are expanded
- // into scalar divisions of each pair of elements.
- Cost += getArithmeticInstrCost(Instruction::ExtractElement, Ty,
- CostKind, Op1Info, Op2Info);
- Cost += getArithmeticInstrCost(Instruction::InsertElement, Ty, CostKind,
- Op1Info, Op2Info);
}
-
- // TODO: if one of the arguments is scalar, then it's not necessary to
- // double the cost of handling the vector elements.
- Cost += Cost;
}
return Cost;
}
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 201bc831b816b3..8845efd241f8f0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -218,7 +218,8 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
const SCEV *Ptr);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 5160851f8c4424..e1d718aa8a3509 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -546,8 +546,8 @@ bool GCNTTIImpl::getTgtMemIntrinsic(IntrinsicInst *Inst,
InstructionCost GCNTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// Legalize the type.
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
index 10956861650ab3..414e4b8f7284a7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
@@ -157,7 +157,8 @@ class GCNTTIImpl final : public BasicTTIImplBase<GCNTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost getCFInstrCost(unsigned Opcode, TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index 0e29648a7a284f..9bf37d8457b02f 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -10,6 +10,7 @@
#include "ARMSubtarget.h"
#include "MCTargetDesc/ARMAddressingModes.h"
#include "llvm/ADT/APInt.h"
+#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/CodeGen/CostTable.h"
@@ -1349,8 +1350,8 @@ InstructionCost ARMTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
InstructionCost ARMTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);
if (ST->isThumb() && CostKind == TTI::TCK_CodeSize && Ty->isIntegerTy(1)) {
// Make operations on i1 relatively expensive as this often involves
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 3a4f940088b2e3..bef278e211a418 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -258,7 +258,8 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost
getMemoryOpCost(unsigned Opcode, Type *Src, MaybeAlign Alignment,
diff --git a/llvm/lib/Target/BPF/BPFTargetTransformInfo.h b/llvm/lib/Target/BPF/BPFTargetTransformInfo.h
index bf0bef3a2b2f98..ca52bd375021c2 100644
--- a/llvm/lib/Target/BPF/BPFTargetTransformInfo.h
+++ b/llvm/lib/Target/BPF/BPFTargetTransformInfo.h
@@ -16,6 +16,7 @@
#define LLVM_LIB_TARGET_BPF_BPFTARGETTRANSFORMINFO_H
#include "BPFTargetMachine.h"
+#include "llvm/ADT/ArrayRef.h"
#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/BasicTTIImpl.h"
#include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"
@@ -61,7 +62,8 @@ class BPFTTIImpl : public BasicTTIImplBase<BPFTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr) {
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) {
int ISD = TLI->InstructionOpcodeToISD(Opcode);
if (ISD == ISD::ADD && CostKind == TTI::TCK_RecipThroughput)
return SCEVCheapExpansionBudget.getValue() + 1;
diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
index bbb9d065b62435..53572321ecec62 100644
--- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
@@ -273,8 +273,8 @@ InstructionCost HexagonTTIImpl::getCmpSelInstrCost(
InstructionCost HexagonTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// TODO: Handle more cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
index 826644d08d1ac0..abaa25136574ae 100644
--- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
+++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
@@ -142,7 +142,8 @@ class HexagonTTIImpl : public BasicTTIImplBase<HexagonTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
TTI::CastContextHint CCH,
TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h b/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h
index 5fe63e4a2e0312..5000a93a818b0a 100644
--- a/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h
+++ b/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h
@@ -94,7 +94,8 @@ class LanaiTTIImpl : public BasicTTIImplBase<LanaiTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr) {
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) {
int ISD = TLI->InstructionOpcodeToISD(Opcode);
switch (ISD) {
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
index 4ec2ec100ab08d..2c194d9fff1ad4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
@@ -485,8 +485,8 @@ NVPTXTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
InstructionCost NVPTXTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// Legalize the type.
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index 0f4fb280b2d996..3d34f8d97ddc47 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -98,7 +98,8 @@ class NVPTXTTIImpl : public BasicTTIImplBase<NVPTXTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKi...
[truncated]
|
@llvm/pr-subscribers-backend-aarch64 Author: Sushant Gokhale (sushgokh) Changes…vision In the backend, last resort of finding the vector division cost is to use its scalar cost. However, without knowledge about the division operands, the cost can be off in certain cases. For SLP, this patch tries to pass scalars for better scalar cost estimation in the backend. Patch is 93.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118055.diff 29 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 985ca1532e0149..50f1210e82f1e2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1297,13 +1297,16 @@ class TargetTransformInfo {
/// provide even more information.
/// \p TLibInfo is used to search for platform specific vector library
/// functions for instructions that might be converted to calls (e.g. frem).
+ /// \p Scalars refers to individual scalars/instructions being used for
+ /// vectorization.
InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput,
TTI::OperandValueInfo Opd1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Opd2Info = {TTI::OK_AnyValue, TTI::OP_None},
ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
- const TargetLibraryInfo *TLibInfo = nullptr) const;
+ const TargetLibraryInfo *TLibInfo = nullptr,
+ ArrayRef<Value *> Scalars = {}) const;
/// Returns the cost estimation for alternating opcode pattern that can be
/// lowered to a single instruction on the target. In X86 this is for the
@@ -2099,7 +2102,8 @@ class TargetTransformInfo::Concept {
virtual InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueInfo Opd1Info, OperandValueInfo Opd2Info,
- ArrayRef<const Value *> Args, const Instruction *CxtI = nullptr) = 0;
+ ArrayRef<const Value *> Args, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) = 0;
virtual InstructionCost getAltInstrCost(
VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
const SmallBitVector &OpcodeMask,
@@ -2780,10 +2784,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueInfo Opd1Info, OperandValueInfo Opd2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI = nullptr) override {
+ ArrayRef<const Value *> Args, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) override {
return Impl.getArithmeticInstrCost(Opcode, Ty, CostKind, Opd1Info, Opd2Info,
- Args, CxtI);
+ Args, CxtI, Scalars);
}
InstructionCost getAltInstrCost(VectorType *VecTy, unsigned Opcode0,
unsigned Opcode1,
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 38aba183f6a173..4aa255a909edbf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -581,11 +581,13 @@ class TargetTransformInfoImplBase {
unsigned getMaxInterleaveFactor(ElementCount VF) const { return 1; }
- InstructionCost getArithmeticInstrCost(
- unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
- TTI::OperandValueInfo Opd1Info, TTI::OperandValueInfo Opd2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI = nullptr) const {
+ InstructionCost getArithmeticInstrCost(unsigned Opcode, Type *Ty,
+ TTI::TargetCostKind CostKind,
+ TTI::OperandValueInfo Opd1Info,
+ TTI::OperandValueInfo Opd2Info,
+ ArrayRef<const Value *> Args,
+ const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) const {
// Widenable conditions will eventually lower into constants, so some
// operations with them will be trivially optimized away.
auto IsWidenableCondition = [](const Value *V) {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 98cbb4886642bf..4a27470763db83 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -924,7 +924,8 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Opd1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Opd2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr) {
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) {
// Check if any of the operands are vector operands.
const TargetLoweringBase *TLI = getTLI();
int ISD = TLI->InstructionOpcodeToISD(Opcode);
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 1fb2b9836de0cc..3f8745b1e06ecc 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -927,7 +927,7 @@ InstructionCost TargetTransformInfo::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueInfo Op1Info, OperandValueInfo Op2Info,
ArrayRef<const Value *> Args, const Instruction *CxtI,
- const TargetLibraryInfo *TLibInfo) const {
+ const TargetLibraryInfo *TLibInfo, ArrayRef<Value *> Scalars) const {
// Use call cost for frem intructions that have platform specific vector math
// functions, as those will be replaced with calls later by SelectionDAG or
@@ -942,10 +942,8 @@ InstructionCost TargetTransformInfo::getArithmeticInstrCost(
return getCallInstrCost(nullptr, VecTy, {VecTy, VecTy}, CostKind);
}
- InstructionCost Cost =
- TTIImpl->getArithmeticInstrCost(Opcode, Ty, CostKind,
- Op1Info, Op2Info,
- Args, CxtI);
+ InstructionCost Cost = TTIImpl->getArithmeticInstrCost(
+ Opcode, Ty, CostKind, Op1Info, Op2Info, Args, CxtI, Scalars);
assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;
}
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index d1536a276a9040..f9434b1691d746 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3376,8 +3376,8 @@ InstructionCost AArch64TTIImpl::getScalarizationOverhead(
InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// The code-generator is currently not able to handle scalable vectors
// of <vscale x 1 x eltty> yet, so return an invalid cost to avoid selecting
@@ -3442,8 +3442,8 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
if (!VT.isVector() && VT.getSizeInBits() > 64)
return getCallInstrCost(/*Function*/ nullptr, Ty, {Ty, Ty}, CostKind);
- InstructionCost Cost = BaseT::getArithmeticInstrCost(
- Opcode, Ty, CostKind, Op1Info, Op2Info);
+ InstructionCost Cost =
+ BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info);
if (Ty->isVectorTy()) {
if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) {
// SDIV/UDIV operations are lowered using SVE, then we can have less
@@ -3472,29 +3472,41 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
Cost *= 4;
return Cost;
} else {
- // If one of the operands is a uniform constant then the cost for each
- // element is Cost for insertion, extraction and division.
- // Insertion cost = 2, Extraction Cost = 2, Division = cost for the
- // operation with scalar type
- if ((Op1Info.isConstant() && Op1Info.isUniform()) ||
- (Op2Info.isConstant() && Op2Info.isUniform())) {
- if (auto *VTy = dyn_cast<FixedVectorType>(Ty)) {
+ if (auto *VTy = dyn_cast<FixedVectorType>(Ty)) {
+ if ((Op1Info.isConstant() && Op1Info.isUniform()) ||
+ (Op2Info.isConstant() && Op2Info.isUniform())) {
InstructionCost DivCost = BaseT::getArithmeticInstrCost(
Opcode, Ty->getScalarType(), CostKind, Op1Info, Op2Info);
- return (4 + DivCost) * VTy->getNumElements();
+ // If #vector_elements = n then we need
+ // n inserts + 2n extracts + n divisions.
+ InstructionCost InsertExtractCost =
+ ST->getVectorInsertExtractBaseCost();
+ Cost = (3 * InsertExtractCost + DivCost) * VTy->getNumElements();
+ } else if (!Scalars.empty()) {
+ // If #vector_elements = n then we need
+ // n inserts + 2n extracts + n divisions.
+ InstructionCost InsertExtractCost =
+ ST->getVectorInsertExtractBaseCost();
+ Cost = (3 * InsertExtractCost) * VTy->getNumElements();
+ for (auto *V : Scalars) {
+ auto *I = cast<Instruction>(V);
+ Cost +=
+ getArithmeticInstrCost(I->getOpcode(), I->getType(), CostKind,
+ TTI::getOperandInfo(I->getOperand(0)),
+ TTI::getOperandInfo(I->getOperand(1)));
+ }
+ } else {
+ // FIXME: The initial cost calculated should have considered extract
+ // cost twice. For now, we just add additional cost to avoid
+ // underestimating the total cost.
+ Cost += Cost;
}
+ } else {
+ // We can't predict the cost of div/extract/insert without knowing the
+ // vector width.
+ Cost.setInvalid();
}
- // On AArch64, without SVE, vector divisions are expanded
- // into scalar divisions of each pair of elements.
- Cost += getArithmeticInstrCost(Instruction::ExtractElement, Ty,
- CostKind, Op1Info, Op2Info);
- Cost += getArithmeticInstrCost(Instruction::InsertElement, Ty, CostKind,
- Op1Info, Op2Info);
}
-
- // TODO: if one of the arguments is scalar, then it's not necessary to
- // double the cost of handling the vector elements.
- Cost += Cost;
}
return Cost;
}
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 201bc831b816b3..8845efd241f8f0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -218,7 +218,8 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
const SCEV *Ptr);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 5160851f8c4424..e1d718aa8a3509 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -546,8 +546,8 @@ bool GCNTTIImpl::getTgtMemIntrinsic(IntrinsicInst *Inst,
InstructionCost GCNTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// Legalize the type.
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
index 10956861650ab3..414e4b8f7284a7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
@@ -157,7 +157,8 @@ class GCNTTIImpl final : public BasicTTIImplBase<GCNTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost getCFInstrCost(unsigned Opcode, TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index 0e29648a7a284f..9bf37d8457b02f 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -10,6 +10,7 @@
#include "ARMSubtarget.h"
#include "MCTargetDesc/ARMAddressingModes.h"
#include "llvm/ADT/APInt.h"
+#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/CodeGen/CostTable.h"
@@ -1349,8 +1350,8 @@ InstructionCost ARMTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
InstructionCost ARMTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);
if (ST->isThumb() && CostKind == TTI::TCK_CodeSize && Ty->isIntegerTy(1)) {
// Make operations on i1 relatively expensive as this often involves
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 3a4f940088b2e3..bef278e211a418 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -258,7 +258,8 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost
getMemoryOpCost(unsigned Opcode, Type *Src, MaybeAlign Alignment,
diff --git a/llvm/lib/Target/BPF/BPFTargetTransformInfo.h b/llvm/lib/Target/BPF/BPFTargetTransformInfo.h
index bf0bef3a2b2f98..ca52bd375021c2 100644
--- a/llvm/lib/Target/BPF/BPFTargetTransformInfo.h
+++ b/llvm/lib/Target/BPF/BPFTargetTransformInfo.h
@@ -16,6 +16,7 @@
#define LLVM_LIB_TARGET_BPF_BPFTARGETTRANSFORMINFO_H
#include "BPFTargetMachine.h"
+#include "llvm/ADT/ArrayRef.h"
#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/BasicTTIImpl.h"
#include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"
@@ -61,7 +62,8 @@ class BPFTTIImpl : public BasicTTIImplBase<BPFTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr) {
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) {
int ISD = TLI->InstructionOpcodeToISD(Opcode);
if (ISD == ISD::ADD && CostKind == TTI::TCK_RecipThroughput)
return SCEVCheapExpansionBudget.getValue() + 1;
diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
index bbb9d065b62435..53572321ecec62 100644
--- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
@@ -273,8 +273,8 @@ InstructionCost HexagonTTIImpl::getCmpSelInstrCost(
InstructionCost HexagonTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// TODO: Handle more cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
index 826644d08d1ac0..abaa25136574ae 100644
--- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
+++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
@@ -142,7 +142,8 @@ class HexagonTTIImpl : public BasicTTIImplBase<HexagonTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr);
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {});
InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
TTI::CastContextHint CCH,
TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h b/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h
index 5fe63e4a2e0312..5000a93a818b0a 100644
--- a/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h
+++ b/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h
@@ -94,7 +94,8 @@ class LanaiTTIImpl : public BasicTTIImplBase<LanaiTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info = {TTI::OK_AnyValue, TTI::OP_None},
TTI::OperandValueInfo Op2Info = {TTI::OK_AnyValue, TTI::OP_None},
- ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr) {
+ ArrayRef<const Value *> Args = {}, const Instruction *CxtI = nullptr,
+ ArrayRef<Value *> Scalars = {}) {
int ISD = TLI->InstructionOpcodeToISD(Opcode);
switch (ISD) {
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
index 4ec2ec100ab08d..2c194d9fff1ad4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
@@ -485,8 +485,8 @@ NVPTXTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
InstructionCost NVPTXTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
- ArrayRef<const Value *> Args,
- const Instruction *CxtI) {
+ ArrayRef<const Value *> Args, const Instruction *CxtI,
+ ArrayRef<Value *> Scalars) {
// Legalize the type.
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index 0f4fb280b2d996..3d34f8d97ddc47 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -98,7 +98,8 @@ class NVPTXTTIImpl : public BasicTTIImplBase<NVPTXTTIImpl> {
unsigned Opcode, Type *Ty, TTI::TargetCostKi...
[truncated]
|
@@ -7,21 +7,21 @@ define i32 @sdiv() { | |||
; CHECK-LABEL: 'sdiv' | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, undef | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I64 = sdiv i64 undef, undef | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = sdiv <2 x i64> undef, undef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this patch is doing two things at once:
- Changing the costs of divides when not passing in the new Scalars parameter,
- Changing the costs of divides when passing in a non-null Scalars.
It's probably worth splitting the changes up into two different PRs so we can see review them independently. It also allows us to monitor the performance impact of each change on it's own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes in llvm/test/Analysis/CostModel/AArch64/div.ll
are not completely orthogonal to my patch (or (1) is not completely orthogonal to (2) ).
This piece of code needs change for following reasons:
- Now that I have extra information about the scalars and information about the scalar type can be passed i.e. instead of passing Op1Info and Op2Info, I need to pass information about the scalar. I forgot to update this. Will update the patch.
- The extract cost, which contributes to the entire cost, is considered once instead of twice here which is not correct I think.
In short, updating this small piece of code would cause changes in llvm/test/Analysis/CostModel/AArch64/div.ll
The perf results, as mentioned in the description, are neutral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-arm segregated two things as you mentioned.
InstructionCost InsertExtractCost = | ||
ST->getVectorInsertExtractBaseCost(); | ||
Cost = (3 * InsertExtractCost + DivCost) * VTy->getNumElements(); | ||
} else if (!Scalars.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Args is not enough for this analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried passing scalars through Args
but this is crashing one of the tests. Didnt debug this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to investigate this rather than adding some extra args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
37a5288
to
de74fc6
Compare
ping |
CommonCost; | ||
return CommonCost + | ||
TTI->getArithmeticInstrCost(ShuffleOrOp, VecTy, CostKind, Op1Info, | ||
Op2Info, E->Scalars, nullptr, TLI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, it should pass operands, not the original scalars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, how does it matter if we pass scalars instead of its operands here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It breaks the the API contract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
de74fc6
to
6596bc9
Compare
return I->getType()->getPrimitiveSizeInBits() >= 32; | ||
}; | ||
if (all_of(E->Scalars, [ShuffleOrOp, IsAllowedScalarTy](Value *V) { | ||
return !IsaPred<UndefValue, PoisonValue>(V) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return !isa<UndefValue, PoisonValue>(V) &&
auto IsAllowedScalarTy = [](const Value *I) { | ||
// Just to avoid regression with RISCV unittest. | ||
return I->getType()->getPrimitiveSizeInBits() >= 32; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hacks are the bad way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Args are being used by the RISCV target for calculating the cast cost. To avoid miscalculation with the patch, had to use this hack.
If we dont want this, we need to introduce other parameter. There's no way out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or adjust RISCVTargetTransformInfo.cpp to handle it correctly, if there is something wrong, or update RISCV tests, if everything is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusting RISCVTargetTransformInfo.cpp or updating the test & justifying the results is beyond my scope . w/o patch, casting is done only for scalar. w/ patch, its being done for all operands of the all the scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusting RISCVTargetTransformInfo.cpp or updating the test & justifying the results is beyond my scope . w/o patch, casting is done only for scalar. w/ patch, its being done for all operands of the all the scalars.
In upstream you should care about all targets, you cannot introduce hacks just for your target. You can do this in your downstream version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats the reason I wanted to introduce one more API parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to debug the RISCV backend for the failing test SLPVectorizer/RISCV/strided-loads-vectorized.ll
and looking at the comment
// f16 with zvfhmin and bf16 will be promoted to f32.
, calculating the cast cost for the args seems a valid choice which isnt happening currently.
But if this is incorrect, could you please help me to avoid this regression?
6596bc9
to
4f6d01d
Compare
Do not use force push |
// If the information about individual scalars being vectorized is | ||
// available, this yeilds better cost estimation. | ||
if (auto *VTy = dyn_cast<FixedVectorType>(Ty); VTy && !Args.empty()) { | ||
InstructionCost InsertExtractCost = | ||
ST->getVectorInsertExtractBaseCost(); | ||
Cost = (3 * InsertExtractCost) * VTy->getNumElements(); | ||
for (int i = 0, Sz = Args.size(); i < Sz; i += 2) { | ||
Cost += getArithmeticInstrCost( | ||
Opcode, VTy->getScalarType(), CostKind, | ||
TTI::getOperandInfo(Args[i]), TTI::getOperandInfo(Args[i + 1])); | ||
} | ||
return Cost; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what are you trying to do here. Why are you adding the cost of the scalar ops here? They are already counted in the cost model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what are you trying to do here. Why are you adding the cost of the scalar ops here? They are already counted in the cost model
AArch64 Neon does not support vector division natively. It scalarizes the vector division as can be seen here.
Now coming to the cost modelling part: This code section is calculating the cost of scalarizing the vector division. Thus,
vector cost =
for each lane(cost of extract for both div operands) +
for each lane(cost of scalar division) +
for each lane(cost of insert into the result vector)
This vector cost was being over-calculated thus preferring scalar code. This patch just tries to rectify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, you should not build the vector node at all. I would suggest instead that you check how the getScalarsVectorizationState
function in SLP works, add a check for NEON division (maybe add a new entry in TTI to check if the vector operation is legal and won't be scalarized), and return that for NEON, it should build the TreeEntry::NeedToGather
node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, you should not build the vector node at all. I would suggest instead that you check how the
getScalarsVectorizationState
function in SLP works, add a check for NEON division (maybe add a new entry in TTI to check if the vector operation is legal and won't be scalarized), and return that for NEON, it should build theTreeEntry::NeedToGather
node.
Two issues which I can think of with this approach:
- The operands of the div operation wont get vectorized.
- Lot of checks are in place that alter the div cost based on information about its operands(e.g. if operands are uniform/constant/pow-of-2 etc.). Hence, factoring out code just to check if the div operation would be scalarized is difficult and would result in duplication of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- That's fine. If you think they should be vectorized, a new vectorization strategy should be implemented instead.
- No, it is just a simple legality check. If the instruction should be scalarized, it should be built into a build vector node; it should not be vectorized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- That's fine. If you think they should be vectorized, a new vectorization strategy should be implemented instead.
Don't you think lowering the division vector in some late IR pass is easy than stopping vectorization at div ? One more advantage of lowering later is lowering is transparent to the user and done at single place rather than at multiple places.
This type of vector lowering is already being done at other places e.g. masked_load
2. No, it is just a simple legality check. If the instruction should be scalarized, it should be built into a build vector node; it should not be vectorized.
Just need a clarification: my understanding of TreeEntry::NeedToGather
is you build a vector out of scalars. Is that correct?
Now, coming to the legality checks. Yes, in the end, its legality check but I need to check all the contraints, which are currently in getArithmeticInstrCost
once again. And I am trying to avoid that route just for reasons mentioned in my comment on (1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- That's fine. If you think they should be vectorized, a new vectorization strategy should be implemented instead.
Don't you think lowering the division vector in some late IR pass is easy than stopping vectorization at div ? One more advantage of lowering later is lowering is transparent to the user and done at single place rather than at multiple places. This type of vector lowering is already being done at other places e.g. masked_load
It just breaks the vectorizer design. If the node should be scalarized, the vectorization model should know about it and scalarize it. Otherwise, it complicates the whole vectorization process and makes it miss some possible vectorization opportunities. I assume, you missed the cost of extraction of the operands in your cost model changes?
- No, it is just a simple legality check. If the instruction should be scalarized, it should be built into a build vector node; it should not be vectorized.
Just need a clarification: my understanding of
TreeEntry::NeedToGather
is you build a vector out of scalars. Is that correct? Now, coming to the legality checks. Yes, in the end, its legality check but I need to check all the contraints, which are currently ingetArithmeticInstrCost
once again. And I am trying to avoid that route just for reasons mentioned in my comment on (1).
You can check these constraints before and command vectorizer to build a NeedToGather node. You should not use getArithmeticInstrCost in this case, some other function should check if it is legal to vectorize these scalars here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, it complicates the whole vectorization process and makes it miss some possible vectorization opportunities. I assume, you missed the cost of extraction of the operands in your cost model changes?
No, in fact the patch is accurately calculating the cost of:
for every lane: 2 extracts and 1 div and 1 insert
See
Cost = (3 * InsertExtractCost) * VTy->getNumElements(); // Insert/Extract cost is same
So, no way it is missing out on opportunities and by no means its complicating the process.
ping |
I happened to be looking at urem costs recently - not exactly related to this but led to a similar area. I was only going to update the urem by constant values but the sdiv costs are similar. I'm not sure this needs to alter the SLP vectorizer for the example tests. So long as all the second operands are constant or power-of-two it should be expanding to a not use a div (and not scalarize). I think we can update this bit of code:
|
The urem patch I mentioned is now in #122236. |
Thanks for the patch. This serves as direction for sdiv/srem and will take this up next. Although it resolves the motivating case, it does not resolve the generic case where Op2 is unknown. Consider below example(motivating example modified with sdiv->udiv and divisor changed to unknowns)
urem patch fails to vectorize this while this patch vectorizes this. In summary, this patch is necessary as it refines the scalar cost. Last thing: I will update the motivating case as above. |
…vision In the backend, last resort of finding the vector division cost is to use its scalar cost. However, without knowledge about the division operands, the cost can be off in certain cases. For SLP, this patch tries to pass scalars for better scalar cost estimation in the backend.
4f6d01d
to
1bdc03a
Compare
@@ -18,6 +18,7 @@ | |||
#include "llvm/CodeGen/BasicTTIImpl.h" | |||
#include "llvm/CodeGen/CostTable.h" | |||
#include "llvm/CodeGen/TargetLowering.h" | |||
#include "llvm/IR/Constants.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove this in next version. Got this accidently
ping |
if (Args.size() == 2 && | ||
!(isa<ConstantVector>(Args[1]) || | ||
isa<ConstantDataVector>(Args[1]) || | ||
isa<ConstantExpr>(Args[1])) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you do not need to pass args for this, instead you can rely on Op1Info and Op2Info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the test where Op1 only or Op2 only is constant(see the added tests), simply relying on Op1Info and Op2Info does not work. In such cases, we need to pass args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cost of the division when the divisor is constant is less than the case where the divisor is unknown.
When we are considering scalarizing cost, we are considering div cost of each lane and additional insert/extract cost. This is where this patch yeilds less cost with extra knowledge about the values that go into that lanes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Op2Info provides info about constantness of the divisor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I told this before already, if the node is scalarized, it must be represented as a buildvector node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the case @sdiv_v2i32_Op1_unknown_Op2_const
where only one of the divisors is constant. In that case, Op2Info would be something generic and doesnt give sufficient information. I hope this makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I told this before already, if the node is scalarized, it must be represented as a buildvector node.
I have tried to justify this why this wont be a better option already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalarized nodes must be represented as buildvectors. If the operands should be vectorized, need to add special processing for such operands. Vectorizing must be scalarized
nodes comes against design of the vectorizer.
…vision
In the backend, last resort of finding the vector division cost is to use its scalar cost. However, without knowledge about the division operands, the cost can be off in certain cases.
For SLP, this patch tries to pass scalars for better scalar cost estimation in the backend.
Tested the patch on Neoverse-v2 with SPEC17(C/C++). No regressions observed.