[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089

ergawy · 2025-01-23T09:11:46Z

Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs.

For example, given the following input:

@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

With the above input, prior to this patch, LDS lowering failed to lower the reference to @lds_item_to_indirectly_load because:

it is indirectly called by a function whose address is taken in the kernel.
we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls).

llvmbot · 2025-01-23T09:12:19Z

@llvm/pr-subscribers-backend-amdgpu

Author: Kareem Ergawy (ergawy)

Changes

Fixes #123800

Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs.

For example, given the following input:

@<!-- -->lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@<!-- -->place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @<!-- -->offloading_kernel() {
  store ptr @<!-- -->indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @<!-- -->place_to_store_indirect_caller, i32 0), align 8
  call void @<!-- -->call_unknown()
  ret void
}

define void @<!-- -->call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @<!-- -->indirectly_load_lds() {
  call void @<!-- -->directly_load_lds()
  ret void
}

define void @<!-- -->directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @<!-- -->lds_item_to_indirectly_load, align 8
  ret void
}

With the above input, prior to this patch, LDS lowering failed to lower the reference to @lds_item_to_indirectly_load because:

it is indirectly called by a function whose address is taken in the kernel.
we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls).

Full diff: https://github.com/llvm/llvm-project/pull/124089.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp (+45-7)
(added) llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll (+44)
(modified) llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll (+1-1)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
index 0406ba9c68ccd3..fc223e23901bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
@@ -141,8 +141,8 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
   FunctionVariableMap DirectMapFunction;
   getUsesOfLDSByFunction(CG, M, DirectMapKernel, DirectMapFunction);
 
-  // Collect variables that are used by functions whose address has escaped
-  DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+  // Collect functions whose address has escaped
+  DenseSet<Function*> AddressTakenFuncs;
   for (Function &F : M.functions()) {
     if (!isKernelLDS(&F))
       if (F.hasAddressTaken(nullptr,
@@ -150,11 +150,16 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
                             /* IgnoreAssumeLikeCalls */ false,
                             /* IgnoreLLVMUsed */ true,
                             /* IgnoreArcAttachedCall */ false)) {
-        set_union(VariablesReachableThroughFunctionPointer,
-                  DirectMapFunction[&F]);
+          AddressTakenFuncs.insert(&F);
       }
   }
 
+  // Collect variables that are used by functions whose address has escaped
+  DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+  for (Function *F : AddressTakenFuncs) {
+    set_union(VariablesReachableThroughFunctionPointer, DirectMapFunction[F]);
+  }
+
   auto FunctionMakesUnknownCall = [&](const Function *F) -> bool {
     assert(!F->isDeclaration());
     for (const CallGraphNode::CallRecord &R : *CG[F]) {
@@ -206,6 +211,13 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
     }
   }
 
+  // Collect variables that are transitively used by functions whose address has
+  // escaped
+  for (Function *F : AddressTakenFuncs) {
+    set_union(VariablesReachableThroughFunctionPointer,
+              TransitiveMapFunction[F]);
+  }
+
   // DirectMapKernel lists which variables are used by the kernel
   // find the variables which are used through a function call
   FunctionVariableMap IndirectMapKernel;
@@ -218,11 +230,37 @@ LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M) {
       Function *Ith = R.second->getFunction();
       if (Ith) {
         set_union(IndirectMapKernel[&Func], TransitiveMapFunction[Ith]);
-      } else {
-        set_union(IndirectMapKernel[&Func],
-                  VariablesReachableThroughFunctionPointer);
       }
     }
+
+    // Check if the kernel encounters unknows calls, wheher directly or
+    // indirectly.
+    bool SeesUnknownCalls = [&]() {
+      SmallVector<Function *> WorkList = {CG[&Func]->getFunction()};
+      SmallPtrSet<Function *, 8> Visited;
+
+      while (!WorkList.empty()) {
+        Function *F = WorkList.pop_back_val();
+
+        for (const CallGraphNode::CallRecord &CallRecord : *CG[F]) {
+          if (!CallRecord.second)
+            continue;
+
+          Function *Callee = CallRecord.second->getFunction();
+          if (!Callee)
+            return true;
+
+          if (Visited.insert(Callee).second)
+            WorkList.push_back(Callee);
+        }
+      }
+      return false;
+    }();
+
+    if (SeesUnknownCalls) {
+      set_union(IndirectMapKernel[&Func],
+                VariablesReachableThroughFunctionPointer);
+    }
   }
 
   // Verify that we fall into one of 2 cases:
diff --git a/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll b/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll
new file mode 100644
index 00000000000000..f14f8d071af564
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-indirect-lds-references.ll
@@ -0,0 +1,44 @@
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+
+; Tests that the LDS lowering pass handles indirect references to LDS GVs; i.e.
+; that it lowers to accesses into the generated LDS struct if these references
+; are deep in the call graph starting at the kernel.
+
+@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8
+
+%store_type = type { i32, ptr }
+@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8
+
+define amdgpu_kernel void @offloading_kernel() {
+  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
+  call void @call_unknown()
+  ret void
+}
+
+define void @call_unknown() {
+  %1 = alloca ptr, align 8
+  %2 = call i32 %1()
+  ret void
+}
+
+define void @indirectly_load_lds() {
+  call void @directly_load_lds()
+  ret void
+}
+
+define void @directly_load_lds() {
+  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
+  ret void
+}
+
+; CHECK: %[[LDS_STRUCT_TY:.*]] = type { %store_type, ptr }
+; CHECK: @[[LDS_STRUCT:.*]] = {{.*}} %[[LDS_STRUCT_TY]] {{.*}} !absolute_symbol
+
+; CHECK: define amdgpu_kernel void @offloading_kernel() {{.*}} {
+; CHECK:   store ptr @indirectly_load_lds, {{.*}} @[[LDS_STRUCT]]
+; CHECK:   call void @call_unknown()
+; CHECK: }
+
+; CHECK: define void @directly_load_lds() {
+; CHECK:   load ptr, {{.*}} (%[[LDS_STRUCT_TY]], {{.*}} @[[LDS_STRUCT]], i32 0, i32 1)
+; CHECK: }
diff --git a/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll b/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
index 2850612d700817..1765bd1cfb0086 100644
--- a/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
+++ b/llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll
@@ -196,7 +196,7 @@ define amdgpu_kernel void @kernel_lds_recursion() {
 ; CHECK: attributes #[[ATTR2]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR3]] = { "amdgpu-lds-size"="4" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR4]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR5]] = { "amdgpu-lds-size"="2" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR5]] = { "amdgpu-lds-size"="4" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR6:[0-9]+]] = { nocallback nofree nosync nounwind willreturn memory(none) }
 ; CHECK: attributes #[[ATTR7:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
 ;.

github-actions · 2025-01-23T09:15:06Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2025-01-23T09:15:06Z

✅ With the latest revision this PR passed the undef deprecator.

Fixes llvm#123800 Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs. For example, given the following input: ```llvm @lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8 %store_type = type { i32, ptr } @place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8 define amdgpu_kernel void @offloading_kernel() { store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8 call void @call_unknown() ret void } define void @call_unknown() { %1 = alloca ptr, align 8 %2 = call i32 %1() ret void } define void @indirectly_load_lds() { call void @directly_load_lds() ret void } define void @directly_load_lds() { %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8 ret void } ``` With the above input, prior to this patch, LDS lowering failed to lower the reference to `@lds_item_to_indirectly_load` because: 1. it is indirectly called by a function whose address is taken in the kernel. 2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls). Co-authored-by: Jon Chesterfield <[email protected]>

JonChesterfield

Algorithm is as we discussed, implementation is completely clear, covers the problem described in the issue. Happy to accept!

I'm nervous about the extra callgraph walk but I think the cunning hacks to avoid it are now DoA given the transitive indirect calls. Something is going to have to crawl the whole graph after all. I should be able to follow this with a NFC that removes some of the redundant pointer chasing.

mjklemm

LGTM

shiltian · 2025-02-17T00:55:42Z

This PR caused a pretty large performance regression for rocRAND (SWDEV-513849). Please consider reverting it if it can't be fixed in a reasonable period of time.

ergawy · 2025-02-17T09:39:05Z

I can take a look tomorrow unless @JonChesterfield wants to do that himself (since he has substantially more experience with this part of the code than me).

JonChesterfield · 2025-02-17T15:37:12Z

This doesn't sound right.

This patch fixes miscompilation of programs with complicated (indirect function themed) call graphs, which were not really expected to exist outside of O0. rocRAND is supposed to be performance sensitive so really shouldn't be full of indirect calls.

It might mean we've regressed codegen on things that were previously compiled correctly in which case we've made a mistake here, it should have been a no-op. Hopefully the internal bug report has some IR associated, it would be useful to see which case we have here.

No to reverting though, this fixed a straightforward miscompilation that was blocking flang O0. We either accept that rocRAND should stop using indirect calls or work out where we've accidentally regressed.

JonChesterfield · 2025-02-17T18:06:30Z

Downloaded the docker containing the compiler that claims a performance regression
Downloaded rocRAND and checked out the specific commit
Set save-temps in environment variable, built the benchmark
Copied benchmark_rocrand_kernel-hip-amdgcn-amd-amdhsa-gfx942.bc out
Ran that bitcode through amd-staging llc with -print-before=amdgpu-lower-module-lds, extracted the IR
Passed that IR through the amd-staging pass with and without this patch, no difference
Ran the same bitcode through the staging llc, extracted different IR
Passed that IR through the amd-staging pass with and without this patch, no difference

I claim that the regression is elsewhere.

shiltian · 2025-02-17T18:13:43Z

I must have made a mistake. Sorry for the noise.

JonChesterfield · 2025-02-17T18:20:50Z

Not at all, thank you for teaching me how our docker infra works! Happy hunting

llvmbot added the backend:AMDGPU label Jan 23, 2025

ergawy force-pushed the handle_indirect_lds_references branch from ae2e16b to 3087adb Compare January 23, 2025 09:16

ergawy force-pushed the handle_indirect_lds_references branch from 3087adb to 70b1fed Compare January 23, 2025 09:22

ergawy requested review from shiltian, skc7 and JonChesterfield January 23, 2025 10:25

JonChesterfield approved these changes Jan 23, 2025

View reviewed changes

ronlieb self-requested a review January 23, 2025 11:52

mjklemm approved these changes Jan 23, 2025

View reviewed changes

ergawy merged commit ff55c9b into llvm:main Jan 23, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089

[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089

Uh oh!

ergawy commented Jan 23, 2025

Uh oh!

llvmbot commented Jan 23, 2025

Uh oh!

github-actions bot commented Jan 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 23, 2025 •

edited

Loading

Uh oh!

JonChesterfield left a comment

Uh oh!

mjklemm left a comment

Uh oh!

Uh oh!

shiltian commented Feb 17, 2025

Uh oh!

ergawy commented Feb 17, 2025

Uh oh!

JonChesterfield commented Feb 17, 2025

Uh oh!

JonChesterfield commented Feb 17, 2025 •

edited

Loading

Uh oh!

shiltian commented Feb 17, 2025

Uh oh!

JonChesterfield commented Feb 17, 2025

Uh oh!

Uh oh!

[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089

[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering #124089

Uh oh!

Conversation

ergawy commented Jan 23, 2025

Uh oh!

llvmbot commented Jan 23, 2025

Uh oh!

github-actions bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonChesterfield left a comment

Choose a reason for hiding this comment

Uh oh!

mjklemm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shiltian commented Feb 17, 2025

Uh oh!

ergawy commented Feb 17, 2025

Uh oh!

JonChesterfield commented Feb 17, 2025

Uh oh!

JonChesterfield commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented Feb 17, 2025

Uh oh!

JonChesterfield commented Feb 17, 2025

Uh oh!

Uh oh!

github-actions bot commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading

JonChesterfield commented Feb 17, 2025 •

edited

Loading