Skip to content

[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

ergawy
Copy link
Member

@ergawy ergawy commented Jan 23, 2025

Fixes #123800

Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs.

For example, given the following input:

@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

With the above input, prior to this patch, LDS lowering failed to lower the reference to @lds_item_to_indirectly_load because:

  1. it is indirectly called by a function whose address is taken in the kernel.
  2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls).

@llvmbot
Copy link
Member

llvmbot commented Jan 23, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Kareem Ergawy (ergawy)

Changes

Fixes #123800

Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs.

For example, given the following input:

@<!-- -->lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@<!-- -->place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @<!-- -->offloading_kernel() {
  store ptr @<!-- -->indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @<!-- -->place_to_store_indirect_caller, i32 0), align 8
  call void @<!-- -->call_unknown()
  ret void
}

define void @<!-- -->call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @<!-- -->indirectly_load_lds() {
  call void @<!-- -->directly_load_lds()
  ret void
}

define void @<!-- -->directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @<!-- -->lds_item_to_indirectly_load, align 8
  ret void
}

With the above input, prior to this patch, LDS lowering failed to lower the reference to @<!-- -->lds_item_to_indirectly_load because:

  1. it is indirectly called by a function whose address is taken in the kernel.
  2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls).

Full diff: https://github.com/llvm/llvm-project/pull/124089.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp (+45-7)
  • (added) llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll (+44)
  • (modified) llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll (+1-1)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
index 0406ba9c68ccd3..fc223e23901bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
@@ -141,8 +141,8 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
   FunctionVariableMap DirectMapFunction;
   getUsesOfLDSByFunction(CG, M, DirectMapKernel, DirectMapFunction);
 
-  // Collect variables that are used by functions whose address has escaped
-  DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+  // Collect functions whose address has escaped
+  DenseSet<Function*> AddressTakenFuncs;
   for (Function &F : M.functions()) {
     if (!isKernelLDS(&F))
       if (F.hasAddressTaken(nullptr,
@@ -150,11 +150,16 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
                             /* IgnoreAssumeLikeCalls */ false,
                             /* IgnoreLLVMUsed */ true,
                             /* IgnoreArcAttachedCall */ false)) {
-        set_union(VariablesReachableThroughFunctionPointer,
-                  DirectMapFunction[&F]);
+          AddressTakenFuncs.insert(&F);
       }
   }
 
+  // Collect variables that are used by functions whose address has escaped
+  DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+  for (Function *F : AddressTakenFuncs) {
+    set_union(VariablesReachableThroughFunctionPointer, DirectMapFunction[F]);
+  }
+
   auto FunctionMakesUnknownCall = [&](const Function *F) -> bool {
     assert(!F->isDeclaration());
     for (const CallGraphNode::CallRecord &R : *CG[F]) {
@@ -206,6 +211,13 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
     }
   }
 
+  // Collect variables that are transitively used by functions whose address has
+  // escaped
+  for (Function *F : AddressTakenFuncs) {
+    set_union(VariablesReachableThroughFunctionPointer,
+              TransitiveMapFunction[F]);
+  }
+
   // DirectMapKernel lists which variables are used by the kernel
   // find the variables which are used through a function call
   FunctionVariableMap IndirectMapKernel;
@@ -218,11 +230,37 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
       Function *Ith = R.second->getFunction();
       if (Ith) {
         set_union(IndirectMapKernel[&Func], TransitiveMapFunction[Ith]);
-      } else {
-        set_union(IndirectMapKernel[&Func],
-                  VariablesReachableThroughFunctionPointer);
       }
     }
+
+    // Check if the kernel encounters unknows calls, wheher directly or
+    // indirectly.
+    bool SeesUnknownCalls = [&]() {
+      SmallVector<Function *> WorkList = {CG[&Func]->getFunction()};
+      SmallPtrSet<Function *, 8> Visited;
+
+      while (!WorkList.empty()) {
+        Function *F = WorkList.pop_back_val();
+
+        for (const CallGraphNode::CallRecord &CallRecord : *CG[F]) {
+          if (!CallRecord.second)
+            continue;
+
+          Function *Callee = CallRecord.second->getFunction();
+          if (!Callee)
+            return true;
+
+          if (Visited.insert(Callee).second)
+            WorkList.push_back(Callee);
+        }
+      }
+      return false;
+    }();
+
+    if (SeesUnknownCalls) {
+      set_union(IndirectMapKernel[&Func],
+                VariablesReachableThroughFunctionPointer);
+    }
   }
 
   // Verify that we fall into one of 2 cases:
diff --git a/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll b/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll
new file mode 100644
index 00000000000000..f14f8d071af564
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll
@@ -0,0 +1,44 @@
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+
+; Tests that the LDS lowering pass handles indirect references to LDS GVs; i.e.
+; that it lowers to accesses into the generated LDS struct if these references
+; are deep in the call graph starting at the kernel.
+
+@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8
+
+%store_type = type { i32, ptr }
+@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8
+
+define amdgpu_kernel void @offloading_kernel() {
+  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
+  call void @call_unknown()
+  ret void
+}
+
+define void @call_unknown() {
+  %1 = alloca ptr, align 8
+  %2 = call i32 %1()
+  ret void
+}
+
+define void @indirectly_load_lds() {
+  call void @directly_load_lds()
+  ret void
+}
+
+define void @directly_load_lds() {
+  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
+  ret void
+}
+
+; CHECK: %[[LDS_STRUCT_TY:.*]] = type { %store_type, ptr }
+; CHECK: @[[LDS_STRUCT:.*]] = {{.*}} %[[LDS_STRUCT_TY]] {{.*}} !absolute_symbol
+
+; CHECK: define amdgpu_kernel void @offloading_kernel() {{.*}} {
+; CHECK:   store ptr @indirectly_load_lds, {{.*}} @[[LDS_STRUCT]]
+; CHECK:   call void @call_unknown()
+; CHECK: }
+
+; CHECK: define void @directly_load_lds() {
+; CHECK:   load ptr, {{.*}} (%[[LDS_STRUCT_TY]], {{.*}} @[[LDS_STRUCT]], i32 0, i32 1)
+; CHECK: }
diff --git a/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll b/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
index 2850612d700817..1765bd1cfb0086 100644
--- a/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
+++ b/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
@@ -196,7 +196,7 @@ define amdgpu_kernel void @kernel_lds_recursion() {
 ; CHECK: attributes #[[ATTR2]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR3]] = { "amdgpu-lds-size"="4" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR4]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR5]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR5]] = { "amdgpu-lds-size"="4" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR6:[0-9]+]] = { nocallback nofree nosync nounwind willreturn memory(none) }
 ; CHECK: attributes #[[ATTR7:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
 ;.

Copy link

github-actions bot commented Jan 23, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link

github-actions bot commented Jan 23, 2025

✅ With the latest revision this PR passed the undef deprecator.

@ergawy ergawy force-pushed the handle_indirect_lds_references branch from ae2e16b to 3087adb Compare January 23, 2025 09:16
Fixes llvm#123800

Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.

For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

```

With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
   kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
   functions (we only checked the direct calls).

Co-authored-by: Jon Chesterfield <[email protected]>
@ergawy ergawy force-pushed the handle_indirect_lds_references branch from 3087adb to 70b1fed Compare January 23, 2025 09:22
Copy link
Collaborator

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithm is as we discussed, implementation is completely clear, covers the problem described in the issue. Happy to accept!

I'm nervous about the extra callgraph walk but I think the cunning hacks to avoid it are now DoA given the transitive indirect calls. Something is going to have to crawl the whole graph after all. I should be able to follow this with a NFC that removes some of the redundant pointer chasing.

@ronlieb ronlieb self-requested a review January 23, 2025 11:52
Copy link
Contributor

@mjklemm mjklemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ergawy ergawy merged commit ff55c9b into llvm:main Jan 23, 2025
8 checks passed
@shiltian
Copy link
Contributor

This PR caused a pretty large performance regression for rocRAND (SWDEV-513849). Please consider reverting it if it can't be fixed in a reasonable period of time.

@ergawy
Copy link
Member Author

ergawy commented Feb 17, 2025

I can take a look tomorrow unless @JonChesterfield wants to do that himself (since he has substantially more experience with this part of the code than me).

@JonChesterfield
Copy link
Collaborator

This doesn't sound right.

This patch fixes miscompilation of programs with complicated (indirect function themed) call graphs, which were not really expected to exist outside of O0. rocRAND is supposed to be performance sensitive so really shouldn't be full of indirect calls.

It might mean we've regressed codegen on things that were previously compiled correctly in which case we've made a mistake here, it should have been a no-op. Hopefully the internal bug report has some IR associated, it would be useful to see which case we have here.

No to reverting though, this fixed a straightforward miscompilation that was blocking flang O0. We either accept that rocRAND should stop using indirect calls or work out where we've accidentally regressed.

@JonChesterfield
Copy link
Collaborator

JonChesterfield commented Feb 17, 2025

  1. Downloaded the docker containing the compiler that claims a performance regression
  2. Downloaded rocRAND and checked out the specific commit
  3. Set save-temps in environment variable, built the benchmark
  4. Copied benchmark_rocrand_kernel-hip-amdgcn-amd-amdhsa-gfx942.bc out
  5. Ran that bitcode through amd-staging llc with -print-before=amdgpu-lower-module-lds, extracted the IR
  6. Passed that IR through the amd-staging pass with and without this patch, no difference
  7. Ran the same bitcode through the staging llc, extracted different IR
  8. Passed that IR through the amd-staging pass with and without this patch, no difference

I claim that the regression is elsewhere.

@shiltian
Copy link
Contributor

I must have made a mistake. Sorry for the noise.

@JonChesterfield
Copy link
Collaborator

Not at all, thank you for teaching me how our docker infra works! Happy hunting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in LDS lowering, F.hasAddressTaken() under-estimates escape analysis
5 participants