Skip to content

[MachineLICM] Work-around Incomplete RegUnits #95926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

Pierre-vh
Copy link
Contributor

@Pierre-vh Pierre-vh commented Jun 18, 2024

Reverts the behavior introduced by 770393b while keeping the refactored code.

Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU.
#96146 opened to investigate the issue

@llvmbot
Copy link
Member

llvmbot commented Jun 18, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Pierre van Houtryve (Pierre-vh)

Changes

Reverts the behavior introduced by 770393b while keeping the refactored code.

Fixes a test failure on AArch64, at the cost of a small regression for AMDGPU which I will investigate. In the meantime, correctness prevails.


Full diff: https://github.com/llvm/llvm-project/pull/95926.diff

3 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineLICM.cpp (+6-8)
  • (added) llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir (+49)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+2-2)
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index d81fe54fe844c..e76e7d9cd3eed 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -426,28 +426,26 @@ static bool InstructionStoresToFI(const MachineInstr *MI, int FI) {
 static void applyBitsNotInRegMaskToRegUnitsMask(const TargetRegisterInfo &TRI,
                                                 BitVector &RUs,
                                                 const uint32_t *Mask) {
-  BitVector ClobberedRUs(TRI.getNumRegUnits(), true);
+  // FIXME: This is overly conservative when applying regmasks from, e.g. calls.
+  // See `test/CodeGen/AMDGPU/indirect-call.ll` regression.
+  BitVector RUsFromRegsNotInMask(TRI.getNumRegUnits());
   const unsigned NumRegs = TRI.getNumRegs();
   const unsigned MaskWords = (NumRegs + 31) / 32;
   for (unsigned K = 0; K < MaskWords; ++K) {
     const uint32_t Word = Mask[K];
-    if (!Word)
-      continue;
-
     for (unsigned Bit = 0; Bit < 32; ++Bit) {
       const unsigned PhysReg = (K * 32) + Bit;
       if (PhysReg == NumRegs)
         break;
 
-      // Check if we have a valid PhysReg that is set in the mask.
-      if ((Word >> Bit) & 1) {
+      if (PhysReg && !((Word >> Bit) & 1)) {
         for (MCRegUnitIterator RUI(PhysReg, &TRI); RUI.isValid(); ++RUI)
-          ClobberedRUs.reset(*RUI);
+          RUsFromRegsNotInMask.set(*RUI);
       }
     }
   }
 
-  RUs |= ClobberedRUs;
+  RUs |= RUsFromRegsNotInMask;
 }
 
 /// Examine the instruction for potentai LICM candidate. Also
diff --git a/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
new file mode 100644
index 0000000000000..f6a0abfdc410b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
@@ -0,0 +1,49 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s
+
+# FIXME: Running RA is needed otherwise it runs pre-RA LICM.
+---
+name: test
+tracksRegLiveness: true
+body: |
+  ; CHECK-LABEL: name: test
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT:   liveins: $x0, $w1, $x2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   B %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; CHECK-NEXT:   liveins: $x0, $w1, $x2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   renamable $q11 = MOVIv4i32 2, 8
+  ; CHECK-NEXT:   BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+  ; CHECK-NEXT:   renamable $q10 = MVNIv4i32 4, 0
+  ; CHECK-NEXT:   $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+  ; CHECK-NEXT:   Bcc 11, %bb.1, implicit $nzcv
+  ; CHECK-NEXT:   B %bb.2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   liveins: $q10, $q11
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $q0 = COPY $q10
+  ; CHECK-NEXT:   $q1 = COPY $q11
+  bb.0:
+    liveins: $x0, $w1, $x2
+    B %bb.1
+
+  bb.1:
+    liveins: $x0, $w1, $x2
+    renamable $q11 = MOVIv4i32 2, 8
+    BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+    renamable $q10 = MVNIv4i32 4, 0
+    $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+    Bcc 11, %bb.1, implicit $nzcv
+    B %bb.2
+
+  bb.2:
+    liveins: $q10, $q11
+    $q0 = COPY $q10
+    $q1 = COPY $q11
+...
diff --git a/llvm/test/CodeGen/AMDGPU/indirect-call.ll b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
index 7799b9509ceb0..da8aa54469835 100644
--- a/llvm/test/CodeGen/AMDGPU/indirect-call.ll
+++ b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
@@ -886,12 +886,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
 ; GCN-NEXT:    v_writelane_b32 v40, s62, 30
 ; GCN-NEXT:    v_writelane_b32 v40, s63, 31
 ; GCN-NEXT:    s_mov_b64 s[6:7], exec
-; GCN-NEXT:    s_movk_i32 s4, 0x7b
 ; GCN-NEXT:  .LBB6_1: ; =>This Inner Loop Header: Depth=1
 ; GCN-NEXT:    v_readfirstlane_b32 s8, v0
 ; GCN-NEXT:    v_readfirstlane_b32 s9, v1
 ; GCN-NEXT:    v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
 ; GCN-NEXT:    s_and_saveexec_b64 s[10:11], vcc
+; GCN-NEXT:    s_movk_i32 s4, 0x7b
 ; GCN-NEXT:    s_swappc_b64 s[30:31], s[8:9]
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; GCN-NEXT:    s_xor_b64 exec, exec, s[10:11]
@@ -980,12 +980,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
 ; GISEL-NEXT:    v_writelane_b32 v40, s62, 30
 ; GISEL-NEXT:    v_writelane_b32 v40, s63, 31
 ; GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GISEL-NEXT:    s_movk_i32 s4, 0x7b
 ; GISEL-NEXT:  .LBB6_1: ; =>This Inner Loop Header: Depth=1
 ; GISEL-NEXT:    v_readfirstlane_b32 s8, v0
 ; GISEL-NEXT:    v_readfirstlane_b32 s9, v1
 ; GISEL-NEXT:    v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
 ; GISEL-NEXT:    s_and_saveexec_b64 s[10:11], vcc
+; GISEL-NEXT:    s_movk_i32 s4, 0x7b
 ; GISEL-NEXT:    s_swappc_b64 s[30:31], s[8:9]
 ; GISEL-NEXT:    ; implicit-def: $vgpr0
 ; GISEL-NEXT:    s_xor_b64 exec, exec, s[10:11]

@llvmbot
Copy link
Member

llvmbot commented Jun 18, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

Reverts the behavior introduced by 770393b while keeping the refactored code.

Fixes a test failure on AArch64, at the cost of a small regression for AMDGPU which I will investigate. In the meantime, correctness prevails.


Full diff: https://github.com/llvm/llvm-project/pull/95926.diff

3 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineLICM.cpp (+6-8)
  • (added) llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir (+49)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+2-2)
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index d81fe54fe844c..e76e7d9cd3eed 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -426,28 +426,26 @@ static bool InstructionStoresToFI(const MachineInstr *MI, int FI) {
 static void applyBitsNotInRegMaskToRegUnitsMask(const TargetRegisterInfo &TRI,
                                                 BitVector &RUs,
                                                 const uint32_t *Mask) {
-  BitVector ClobberedRUs(TRI.getNumRegUnits(), true);
+  // FIXME: This is overly conservative when applying regmasks from, e.g. calls.
+  // See `test/CodeGen/AMDGPU/indirect-call.ll` regression.
+  BitVector RUsFromRegsNotInMask(TRI.getNumRegUnits());
   const unsigned NumRegs = TRI.getNumRegs();
   const unsigned MaskWords = (NumRegs + 31) / 32;
   for (unsigned K = 0; K < MaskWords; ++K) {
     const uint32_t Word = Mask[K];
-    if (!Word)
-      continue;
-
     for (unsigned Bit = 0; Bit < 32; ++Bit) {
       const unsigned PhysReg = (K * 32) + Bit;
       if (PhysReg == NumRegs)
         break;
 
-      // Check if we have a valid PhysReg that is set in the mask.
-      if ((Word >> Bit) & 1) {
+      if (PhysReg && !((Word >> Bit) & 1)) {
         for (MCRegUnitIterator RUI(PhysReg, &TRI); RUI.isValid(); ++RUI)
-          ClobberedRUs.reset(*RUI);
+          RUsFromRegsNotInMask.set(*RUI);
       }
     }
   }
 
-  RUs |= ClobberedRUs;
+  RUs |= RUsFromRegsNotInMask;
 }
 
 /// Examine the instruction for potentai LICM candidate. Also
diff --git a/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
new file mode 100644
index 0000000000000..f6a0abfdc410b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
@@ -0,0 +1,49 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s
+
+# FIXME: Running RA is needed otherwise it runs pre-RA LICM.
+---
+name: test
+tracksRegLiveness: true
+body: |
+  ; CHECK-LABEL: name: test
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT:   liveins: $x0, $w1, $x2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   B %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; CHECK-NEXT:   liveins: $x0, $w1, $x2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   renamable $q11 = MOVIv4i32 2, 8
+  ; CHECK-NEXT:   BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+  ; CHECK-NEXT:   renamable $q10 = MVNIv4i32 4, 0
+  ; CHECK-NEXT:   $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+  ; CHECK-NEXT:   Bcc 11, %bb.1, implicit $nzcv
+  ; CHECK-NEXT:   B %bb.2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   liveins: $q10, $q11
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $q0 = COPY $q10
+  ; CHECK-NEXT:   $q1 = COPY $q11
+  bb.0:
+    liveins: $x0, $w1, $x2
+    B %bb.1
+
+  bb.1:
+    liveins: $x0, $w1, $x2
+    renamable $q11 = MOVIv4i32 2, 8
+    BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+    renamable $q10 = MVNIv4i32 4, 0
+    $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+    Bcc 11, %bb.1, implicit $nzcv
+    B %bb.2
+
+  bb.2:
+    liveins: $q10, $q11
+    $q0 = COPY $q10
+    $q1 = COPY $q11
+...
diff --git a/llvm/test/CodeGen/AMDGPU/indirect-call.ll b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
index 7799b9509ceb0..da8aa54469835 100644
--- a/llvm/test/CodeGen/AMDGPU/indirect-call.ll
+++ b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
@@ -886,12 +886,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
 ; GCN-NEXT:    v_writelane_b32 v40, s62, 30
 ; GCN-NEXT:    v_writelane_b32 v40, s63, 31
 ; GCN-NEXT:    s_mov_b64 s[6:7], exec
-; GCN-NEXT:    s_movk_i32 s4, 0x7b
 ; GCN-NEXT:  .LBB6_1: ; =>This Inner Loop Header: Depth=1
 ; GCN-NEXT:    v_readfirstlane_b32 s8, v0
 ; GCN-NEXT:    v_readfirstlane_b32 s9, v1
 ; GCN-NEXT:    v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
 ; GCN-NEXT:    s_and_saveexec_b64 s[10:11], vcc
+; GCN-NEXT:    s_movk_i32 s4, 0x7b
 ; GCN-NEXT:    s_swappc_b64 s[30:31], s[8:9]
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; GCN-NEXT:    s_xor_b64 exec, exec, s[10:11]
@@ -980,12 +980,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
 ; GISEL-NEXT:    v_writelane_b32 v40, s62, 30
 ; GISEL-NEXT:    v_writelane_b32 v40, s63, 31
 ; GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GISEL-NEXT:    s_movk_i32 s4, 0x7b
 ; GISEL-NEXT:  .LBB6_1: ; =>This Inner Loop Header: Depth=1
 ; GISEL-NEXT:    v_readfirstlane_b32 s8, v0
 ; GISEL-NEXT:    v_readfirstlane_b32 s9, v1
 ; GISEL-NEXT:    v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
 ; GISEL-NEXT:    s_and_saveexec_b64 s[10:11], vcc
+; GISEL-NEXT:    s_movk_i32 s4, 0x7b
 ; GISEL-NEXT:    s_swappc_b64 s[30:31], s[8:9]
 ; GISEL-NEXT:    ; implicit-def: $vgpr0
 ; GISEL-NEXT:    s_xor_b64 exec, exec, s[10:11]

@@ -0,0 +1,49 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see why this is using -debug, but that requires asserts

# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s

# FIXME: Running RA is needed otherwise it runs pre-RA LICM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but machinelicm should be the post-rA one? early-machinelicm is pre-RA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it only triggers if I run RA first, otherwise it seems MRI->isSSA() returns true and that's what it uses to determine whether it's pre or post RA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are working around the MIR parser bug where it assumes SSA until it finds something obviously not SSA. The pass needs to clear the properties and I think we need an explicit SSA property

; CHECK-NEXT: {{ $}}
; CHECK-NEXT: renamable $q11 = MOVIv4i32 2, 8
; CHECK-NEXT: BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
; CHECK-NEXT: renamable $q10 = MVNIv4i32 4, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test actually demonstrate? Without this patch, I see that the def of q10 gets hoisted past this memset call:

BL &memset, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11 
$h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 $s13 $s14 $s15 $w19 $w20 $w21 $w22 $w23 $w24 $w25 $w26 $w27 $w28 $w29 $w30 $x19 $x20 $x21 $x22 $x23 
$x24 $x25 $x26 $x27 $x28 $d8_d9 $d9_d10 $d10_d11 $d11_d12 $d12_d13 $d13_d14 $d14_d15 $d8_d9_d10_d11 $d9_d10_d11_d12 $d10_d11_d12_d13 $d11_d12_d
13_d14 $d12_d13_d14_d15 $d8_d9_d10 $d9_d10_d11 $d10_d11_d12 $d11_d12_d13 $d12_d13_d14 $d13_d14_d15 $x22_x23_x24_x25_x26_x27_x28_fp $x20_x21_x22
_x23_x24_x25_x26_x27 $w20_w21 $w22_w23 $w24_w25 $w26_w27 $w28_w29 $x28_fp $x20_x21 $x22_x23 $x24_x25 $x26_x27>, implicit-def dead $lr, implicit
 $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0

I don't understand why this happened. q10 has subregs d20 and d21 but I don't see any of those listed in this regmask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The D20 D21 being subregs of Q10 is an ARM thing that wasn't carried over into AArch64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh - I thought this test case was ARM! Let me take another look...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get my head around what was going on. There is a rule in the AArch64 PCS which talks about the top bits of a Q register needing to be preserved across a call:

Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7,
v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64
bits of each value stored in v8-v15 need to be preserved 7; it is the responsibility of the caller to preserve
larger values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so AArch64 registers work like this: b10 aliases the low 8 bits of h10 which aliases the low 16 bits of s10 which aliases the low 32 bits of d10 which aliases the low 64 bits of q10 which aliases the low 128 bits of z10.

My suspicion is that TableGen only creates one regunit for all of these *10 registers. If that's true then there's no way for regunit-based clobbering info to express that d10 is preserved but q10 is not (because its high bits are clobbered). So I think we would have to go back and revert #94608.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a flag to registers to tell TableGen that subregs cover the whole register (EDIT: we already have one)

What is that flag that we already have?

Anyway I think this kind of approach sounds reasonable, but should probably get some wider discussion. Possible alternative ideas:

  • Declare the width of every physical register, so that tablegen can work out by itself whether or not it is completely covered by subregs.
  • Change the AArch64 register definitions to declare some kind of dummy subreg for the high parts, which is only used to force tablegen to generate the extra regunits. The only advantage of this is that we might not have to modify tablegen at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #96146 to continue the discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #96146 to continue the discussion

Thanks. In the mean time I guess it's OK to commit the current patch to fix the AArch64 breakage, with a comment explaining that it's overly conservative (hence the AMDGPU test regression).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Sorry, all the subtleties of RegUnits are not something I know a lot about (and I may be a bit unresponsive today, sorry if I am). Perhaps @efriedma-quic or @TNorthover know more about how it should all work. We would usually revert back to a known-good state or fix quickly, and leaving it broken for any length of time isn't the llvm way of doing things. So I'm all in favour of this fix, thanks.

We have scalable registers in AArch64 aliasing the same registers too, to throw another potential spanner in the works. I don't believe those registers actually get marked as scalable at the moment though. If we go with option 3 then we should try and make sure that works OK too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry all, I'm just seeing this thread now.
Apologies if I repeat things that have already been said, I haven't read through the whole history.

I agree that would be nice but that is not what TableGen currently does. I've just checked and all of b10 h10 s10 d10 q10 z10 have a single regunit number 35.

@jayfoad is right this is how the regunits work and the issue with register mask is well known (well at least @MatzeB and I know :P). We shouldn't use regunits for regmask operands because they are indeed not precise enough until we go with option 2.

(And by the way changing regunits shouldn't change regalloc. I'm curious to see why you @Pierre-vh saw this.)

Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU.

llvm#96146 opened to investigate the issue.
@Pierre-vh Pierre-vh force-pushed the fix-machine-licm-again branch from 4792ee5 to f8da46e Compare June 20, 2024 08:47
@Pierre-vh Pierre-vh changed the title [MachineLICM] Workaround - apply RegMasks conservatively [MachineLICM] Work-around Incomplete RegUnits Jun 20, 2024
@Pierre-vh Pierre-vh requested review from arsenm and jayfoad June 20, 2024 08:48
@Pierre-vh
Copy link
Contributor Author

Pierre-vh commented Jun 20, 2024

I updated the patch to include a thorough description of the problem. I would like to move forward with this fix so we don't have a miscompile in trunk while we investigate the issue, without having to revert a patch that solves a major performance issue in MachineLICM in AMDGPU.

@Pierre-vh Pierre-vh merged commit f0897ea into llvm:main Jun 20, 2024
4 of 6 checks passed
@Pierre-vh Pierre-vh deleted the fix-machine-licm-again branch June 20, 2024 08:59
@mstorsjo
Copy link
Member

Thanks; with this fix merged now, my original case (as explained in #95746 (comment)) works correctly again.

AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
Reverts the behavior introduced by 770393b while keeping the refactored
code.

Fixes a miscompile on AArch64, at the cost of a small regression on
AMDGPU.
llvm#96146 opened to investigate the issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants