-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Kareem Ergawy (ergawy) ChangesFixes #123800 Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs. For example, given the following input: @<!-- -->lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8
%store_type = type { i32, ptr }
@<!-- -->place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8
define amdgpu_kernel void @<!-- -->offloading_kernel() {
store ptr @<!-- -->indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @<!-- -->place_to_store_indirect_caller, i32 0), align 8
call void @<!-- -->call_unknown()
ret void
}
define void @<!-- -->call_unknown() {
%1 = alloca ptr, align 8
%2 = call i32 %1()
ret void
}
define void @<!-- -->indirectly_load_lds() {
call void @<!-- -->directly_load_lds()
ret void
}
define void @<!-- -->directly_load_lds() {
%2 = load ptr, ptr addrspace(3) @<!-- -->lds_item_to_indirectly_load, align 8
ret void
}
With the above input, prior to this patch, LDS lowering failed to lower the reference to
Full diff: https://github.com/llvm/llvm-project/pull/124089.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
index 0406ba9c68ccd3..fc223e23901bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
@@ -141,8 +141,8 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
FunctionVariableMap DirectMapFunction;
getUsesOfLDSByFunction(CG, M, DirectMapKernel, DirectMapFunction);
- // Collect variables that are used by functions whose address has escaped
- DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+ // Collect functions whose address has escaped
+ DenseSet<Function*> AddressTakenFuncs;
for (Function &F : M.functions()) {
if (!isKernelLDS(&F))
if (F.hasAddressTaken(nullptr,
@@ -150,11 +150,16 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
/* IgnoreAssumeLikeCalls */ false,
/* IgnoreLLVMUsed */ true,
/* IgnoreArcAttachedCall */ false)) {
- set_union(VariablesReachableThroughFunctionPointer,
- DirectMapFunction[&F]);
+ AddressTakenFuncs.insert(&F);
}
}
+ // Collect variables that are used by functions whose address has escaped
+ DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+ for (Function *F : AddressTakenFuncs) {
+ set_union(VariablesReachableThroughFunctionPointer, DirectMapFunction[F]);
+ }
+
auto FunctionMakesUnknownCall = [&](const Function *F) -> bool {
assert(!F->isDeclaration());
for (const CallGraphNode::CallRecord &R : *CG[F]) {
@@ -206,6 +211,13 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
}
}
+ // Collect variables that are transitively used by functions whose address has
+ // escaped
+ for (Function *F : AddressTakenFuncs) {
+ set_union(VariablesReachableThroughFunctionPointer,
+ TransitiveMapFunction[F]);
+ }
+
// DirectMapKernel lists which variables are used by the kernel
// find the variables which are used through a function call
FunctionVariableMap IndirectMapKernel;
@@ -218,11 +230,37 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
Function *Ith = R.second->getFunction();
if (Ith) {
set_union(IndirectMapKernel[&Func], TransitiveMapFunction[Ith]);
- } else {
- set_union(IndirectMapKernel[&Func],
- VariablesReachableThroughFunctionPointer);
}
}
+
+ // Check if the kernel encounters unknows calls, wheher directly or
+ // indirectly.
+ bool SeesUnknownCalls = [&]() {
+ SmallVector<Function *> WorkList = {CG[&Func]->getFunction()};
+ SmallPtrSet<Function *, 8> Visited;
+
+ while (!WorkList.empty()) {
+ Function *F = WorkList.pop_back_val();
+
+ for (const CallGraphNode::CallRecord &CallRecord : *CG[F]) {
+ if (!CallRecord.second)
+ continue;
+
+ Function *Callee = CallRecord.second->getFunction();
+ if (!Callee)
+ return true;
+
+ if (Visited.insert(Callee).second)
+ WorkList.push_back(Callee);
+ }
+ }
+ return false;
+ }();
+
+ if (SeesUnknownCalls) {
+ set_union(IndirectMapKernel[&Func],
+ VariablesReachableThroughFunctionPointer);
+ }
}
// Verify that we fall into one of 2 cases:
diff --git a/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll b/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll
new file mode 100644
index 00000000000000..f14f8d071af564
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll
@@ -0,0 +1,44 @@
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+
+; Tests that the LDS lowering pass handles indirect references to LDS GVs; i.e.
+; that it lowers to accesses into the generated LDS struct if these references
+; are deep in the call graph starting at the kernel.
+
+@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8
+
+%store_type = type { i32, ptr }
+@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8
+
+define amdgpu_kernel void @offloading_kernel() {
+ store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
+ call void @call_unknown()
+ ret void
+}
+
+define void @call_unknown() {
+ %1 = alloca ptr, align 8
+ %2 = call i32 %1()
+ ret void
+}
+
+define void @indirectly_load_lds() {
+ call void @directly_load_lds()
+ ret void
+}
+
+define void @directly_load_lds() {
+ %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
+ ret void
+}
+
+; CHECK: %[[LDS_STRUCT_TY:.*]] = type { %store_type, ptr }
+; CHECK: @[[LDS_STRUCT:.*]] = {{.*}} %[[LDS_STRUCT_TY]] {{.*}} !absolute_symbol
+
+; CHECK: define amdgpu_kernel void @offloading_kernel() {{.*}} {
+; CHECK: store ptr @indirectly_load_lds, {{.*}} @[[LDS_STRUCT]]
+; CHECK: call void @call_unknown()
+; CHECK: }
+
+; CHECK: define void @directly_load_lds() {
+; CHECK: load ptr, {{.*}} (%[[LDS_STRUCT_TY]], {{.*}} @[[LDS_STRUCT]], i32 0, i32 1)
+; CHECK: }
diff --git a/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll b/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
index 2850612d700817..1765bd1cfb0086 100644
--- a/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
+++ b/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
@@ -196,7 +196,7 @@ define amdgpu_kernel void @kernel_lds_recursion() {
; CHECK: attributes #[[ATTR2]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
; CHECK: attributes #[[ATTR3]] = { "amdgpu-lds-size"="4" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
; CHECK: attributes #[[ATTR4]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR5]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR5]] = { "amdgpu-lds-size"="4" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
; CHECK: attributes #[[ATTR6:[0-9]+]] = { nocallback nofree nosync nounwind willreturn memory(none) }
; CHECK: attributes #[[ATTR7:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
;.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
✅ With the latest revision this PR passed the undef deprecator. |
ae2e16b
to
3087adb
Compare
Fixes llvm#123800 Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs. For example, given the following input: ```llvm @lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8 %store_type = type { i32, ptr } @place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8 define amdgpu_kernel void @offloading_kernel() { store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8 call void @call_unknown() ret void } define void @call_unknown() { %1 = alloca ptr, align 8 %2 = call i32 %1() ret void } define void @indirectly_load_lds() { call void @directly_load_lds() ret void } define void @directly_load_lds() { %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8 ret void } ``` With the above input, prior to this patch, LDS lowering failed to lower the reference to `@lds_item_to_indirectly_load` because: 1. it is indirectly called by a function whose address is taken in the kernel. 2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls). Co-authored-by: Jon Chesterfield <[email protected]>
3087adb
to
70b1fed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithm is as we discussed, implementation is completely clear, covers the problem described in the issue. Happy to accept!
I'm nervous about the extra callgraph walk but I think the cunning hacks to avoid it are now DoA given the transitive indirect calls. Something is going to have to crawl the whole graph after all. I should be able to follow this with a NFC that removes some of the redundant pointer chasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR caused a pretty large performance regression for rocRAND (SWDEV-513849). Please consider reverting it if it can't be fixed in a reasonable period of time. |
I can take a look tomorrow unless @JonChesterfield wants to do that himself (since he has substantially more experience with this part of the code than me). |
This doesn't sound right. This patch fixes miscompilation of programs with complicated (indirect function themed) call graphs, which were not really expected to exist outside of O0. rocRAND is supposed to be performance sensitive so really shouldn't be full of indirect calls. It might mean we've regressed codegen on things that were previously compiled correctly in which case we've made a mistake here, it should have been a no-op. Hopefully the internal bug report has some IR associated, it would be useful to see which case we have here. No to reverting though, this fixed a straightforward miscompilation that was blocking flang O0. We either accept that rocRAND should stop using indirect calls or work out where we've accidentally regressed. |
I claim that the regression is elsewhere. |
I must have made a mistake. Sorry for the noise. |
Not at all, thank you for teaching me how our docker infra works! Happy hunting |
Fixes #123800
Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs.
For example, given the following input:
With the above input, prior to this patch, LDS lowering failed to lower the reference to
@lds_item_to_indirectly_load
because: