-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[MachineLICM] Work-around Incomplete RegUnits #95926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Pierre van Houtryve (Pierre-vh) ChangesReverts the behavior introduced by 770393b while keeping the refactored code. Fixes a test failure on AArch64, at the cost of a small regression for AMDGPU which I will investigate. In the meantime, correctness prevails. Full diff: https://github.com/llvm/llvm-project/pull/95926.diff 3 Files Affected:
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index d81fe54fe844c..e76e7d9cd3eed 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -426,28 +426,26 @@ static bool InstructionStoresToFI(const MachineInstr *MI, int FI) {
static void applyBitsNotInRegMaskToRegUnitsMask(const TargetRegisterInfo &TRI,
BitVector &RUs,
const uint32_t *Mask) {
- BitVector ClobberedRUs(TRI.getNumRegUnits(), true);
+ // FIXME: This is overly conservative when applying regmasks from, e.g. calls.
+ // See `test/CodeGen/AMDGPU/indirect-call.ll` regression.
+ BitVector RUsFromRegsNotInMask(TRI.getNumRegUnits());
const unsigned NumRegs = TRI.getNumRegs();
const unsigned MaskWords = (NumRegs + 31) / 32;
for (unsigned K = 0; K < MaskWords; ++K) {
const uint32_t Word = Mask[K];
- if (!Word)
- continue;
-
for (unsigned Bit = 0; Bit < 32; ++Bit) {
const unsigned PhysReg = (K * 32) + Bit;
if (PhysReg == NumRegs)
break;
- // Check if we have a valid PhysReg that is set in the mask.
- if ((Word >> Bit) & 1) {
+ if (PhysReg && !((Word >> Bit) & 1)) {
for (MCRegUnitIterator RUI(PhysReg, &TRI); RUI.isValid(); ++RUI)
- ClobberedRUs.reset(*RUI);
+ RUsFromRegsNotInMask.set(*RUI);
}
}
}
- RUs |= ClobberedRUs;
+ RUs |= RUsFromRegsNotInMask;
}
/// Examine the instruction for potentai LICM candidate. Also
diff --git a/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
new file mode 100644
index 0000000000000..f6a0abfdc410b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
@@ -0,0 +1,49 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s
+
+# FIXME: Running RA is needed otherwise it runs pre-RA LICM.
+---
+name: test
+tracksRegLiveness: true
+body: |
+ ; CHECK-LABEL: name: test
+ ; CHECK: bb.0:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: liveins: $x0, $w1, $x2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: B %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.1:
+ ; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
+ ; CHECK-NEXT: liveins: $x0, $w1, $x2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: renamable $q11 = MOVIv4i32 2, 8
+ ; CHECK-NEXT: BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+ ; CHECK-NEXT: renamable $q10 = MVNIv4i32 4, 0
+ ; CHECK-NEXT: $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+ ; CHECK-NEXT: Bcc 11, %bb.1, implicit $nzcv
+ ; CHECK-NEXT: B %bb.2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.2:
+ ; CHECK-NEXT: liveins: $q10, $q11
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: $q0 = COPY $q10
+ ; CHECK-NEXT: $q1 = COPY $q11
+ bb.0:
+ liveins: $x0, $w1, $x2
+ B %bb.1
+
+ bb.1:
+ liveins: $x0, $w1, $x2
+ renamable $q11 = MOVIv4i32 2, 8
+ BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+ renamable $q10 = MVNIv4i32 4, 0
+ $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+ Bcc 11, %bb.1, implicit $nzcv
+ B %bb.2
+
+ bb.2:
+ liveins: $q10, $q11
+ $q0 = COPY $q10
+ $q1 = COPY $q11
+...
diff --git a/llvm/test/CodeGen/AMDGPU/indirect-call.ll b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
index 7799b9509ceb0..da8aa54469835 100644
--- a/llvm/test/CodeGen/AMDGPU/indirect-call.ll
+++ b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
@@ -886,12 +886,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
; GCN-NEXT: v_writelane_b32 v40, s62, 30
; GCN-NEXT: v_writelane_b32 v40, s63, 31
; GCN-NEXT: s_mov_b64 s[6:7], exec
-; GCN-NEXT: s_movk_i32 s4, 0x7b
; GCN-NEXT: .LBB6_1: ; =>This Inner Loop Header: Depth=1
; GCN-NEXT: v_readfirstlane_b32 s8, v0
; GCN-NEXT: v_readfirstlane_b32 s9, v1
; GCN-NEXT: v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
; GCN-NEXT: s_and_saveexec_b64 s[10:11], vcc
+; GCN-NEXT: s_movk_i32 s4, 0x7b
; GCN-NEXT: s_swappc_b64 s[30:31], s[8:9]
; GCN-NEXT: ; implicit-def: $vgpr0_vgpr1
; GCN-NEXT: s_xor_b64 exec, exec, s[10:11]
@@ -980,12 +980,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
; GISEL-NEXT: v_writelane_b32 v40, s62, 30
; GISEL-NEXT: v_writelane_b32 v40, s63, 31
; GISEL-NEXT: s_mov_b64 s[6:7], exec
-; GISEL-NEXT: s_movk_i32 s4, 0x7b
; GISEL-NEXT: .LBB6_1: ; =>This Inner Loop Header: Depth=1
; GISEL-NEXT: v_readfirstlane_b32 s8, v0
; GISEL-NEXT: v_readfirstlane_b32 s9, v1
; GISEL-NEXT: v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
; GISEL-NEXT: s_and_saveexec_b64 s[10:11], vcc
+; GISEL-NEXT: s_movk_i32 s4, 0x7b
; GISEL-NEXT: s_swappc_b64 s[30:31], s[8:9]
; GISEL-NEXT: ; implicit-def: $vgpr0
; GISEL-NEXT: s_xor_b64 exec, exec, s[10:11]
|
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesReverts the behavior introduced by 770393b while keeping the refactored code. Fixes a test failure on AArch64, at the cost of a small regression for AMDGPU which I will investigate. In the meantime, correctness prevails. Full diff: https://github.com/llvm/llvm-project/pull/95926.diff 3 Files Affected:
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index d81fe54fe844c..e76e7d9cd3eed 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -426,28 +426,26 @@ static bool InstructionStoresToFI(const MachineInstr *MI, int FI) {
static void applyBitsNotInRegMaskToRegUnitsMask(const TargetRegisterInfo &TRI,
BitVector &RUs,
const uint32_t *Mask) {
- BitVector ClobberedRUs(TRI.getNumRegUnits(), true);
+ // FIXME: This is overly conservative when applying regmasks from, e.g. calls.
+ // See `test/CodeGen/AMDGPU/indirect-call.ll` regression.
+ BitVector RUsFromRegsNotInMask(TRI.getNumRegUnits());
const unsigned NumRegs = TRI.getNumRegs();
const unsigned MaskWords = (NumRegs + 31) / 32;
for (unsigned K = 0; K < MaskWords; ++K) {
const uint32_t Word = Mask[K];
- if (!Word)
- continue;
-
for (unsigned Bit = 0; Bit < 32; ++Bit) {
const unsigned PhysReg = (K * 32) + Bit;
if (PhysReg == NumRegs)
break;
- // Check if we have a valid PhysReg that is set in the mask.
- if ((Word >> Bit) & 1) {
+ if (PhysReg && !((Word >> Bit) & 1)) {
for (MCRegUnitIterator RUI(PhysReg, &TRI); RUI.isValid(); ++RUI)
- ClobberedRUs.reset(*RUI);
+ RUsFromRegsNotInMask.set(*RUI);
}
}
}
- RUs |= ClobberedRUs;
+ RUs |= RUsFromRegsNotInMask;
}
/// Examine the instruction for potentai LICM candidate. Also
diff --git a/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
new file mode 100644
index 0000000000000..f6a0abfdc410b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
@@ -0,0 +1,49 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s
+
+# FIXME: Running RA is needed otherwise it runs pre-RA LICM.
+---
+name: test
+tracksRegLiveness: true
+body: |
+ ; CHECK-LABEL: name: test
+ ; CHECK: bb.0:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: liveins: $x0, $w1, $x2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: B %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.1:
+ ; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
+ ; CHECK-NEXT: liveins: $x0, $w1, $x2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: renamable $q11 = MOVIv4i32 2, 8
+ ; CHECK-NEXT: BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+ ; CHECK-NEXT: renamable $q10 = MVNIv4i32 4, 0
+ ; CHECK-NEXT: $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+ ; CHECK-NEXT: Bcc 11, %bb.1, implicit $nzcv
+ ; CHECK-NEXT: B %bb.2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.2:
+ ; CHECK-NEXT: liveins: $q10, $q11
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: $q0 = COPY $q10
+ ; CHECK-NEXT: $q1 = COPY $q11
+ bb.0:
+ liveins: $x0, $w1, $x2
+ B %bb.1
+
+ bb.1:
+ liveins: $x0, $w1, $x2
+ renamable $q11 = MOVIv4i32 2, 8
+ BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
+ renamable $q10 = MVNIv4i32 4, 0
+ $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
+ Bcc 11, %bb.1, implicit $nzcv
+ B %bb.2
+
+ bb.2:
+ liveins: $q10, $q11
+ $q0 = COPY $q10
+ $q1 = COPY $q11
+...
diff --git a/llvm/test/CodeGen/AMDGPU/indirect-call.ll b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
index 7799b9509ceb0..da8aa54469835 100644
--- a/llvm/test/CodeGen/AMDGPU/indirect-call.ll
+++ b/llvm/test/CodeGen/AMDGPU/indirect-call.ll
@@ -886,12 +886,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
; GCN-NEXT: v_writelane_b32 v40, s62, 30
; GCN-NEXT: v_writelane_b32 v40, s63, 31
; GCN-NEXT: s_mov_b64 s[6:7], exec
-; GCN-NEXT: s_movk_i32 s4, 0x7b
; GCN-NEXT: .LBB6_1: ; =>This Inner Loop Header: Depth=1
; GCN-NEXT: v_readfirstlane_b32 s8, v0
; GCN-NEXT: v_readfirstlane_b32 s9, v1
; GCN-NEXT: v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
; GCN-NEXT: s_and_saveexec_b64 s[10:11], vcc
+; GCN-NEXT: s_movk_i32 s4, 0x7b
; GCN-NEXT: s_swappc_b64 s[30:31], s[8:9]
; GCN-NEXT: ; implicit-def: $vgpr0_vgpr1
; GCN-NEXT: s_xor_b64 exec, exec, s[10:11]
@@ -980,12 +980,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
; GISEL-NEXT: v_writelane_b32 v40, s62, 30
; GISEL-NEXT: v_writelane_b32 v40, s63, 31
; GISEL-NEXT: s_mov_b64 s[6:7], exec
-; GISEL-NEXT: s_movk_i32 s4, 0x7b
; GISEL-NEXT: .LBB6_1: ; =>This Inner Loop Header: Depth=1
; GISEL-NEXT: v_readfirstlane_b32 s8, v0
; GISEL-NEXT: v_readfirstlane_b32 s9, v1
; GISEL-NEXT: v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
; GISEL-NEXT: s_and_saveexec_b64 s[10:11], vcc
+; GISEL-NEXT: s_movk_i32 s4, 0x7b
; GISEL-NEXT: s_swappc_b64 s[30:31], s[8:9]
; GISEL-NEXT: ; implicit-def: $vgpr0
; GISEL-NEXT: s_xor_b64 exec, exec, s[10:11]
|
@@ -0,0 +1,49 @@ | |||
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 | |||
# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see why this is using -debug, but that requires asserts
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 | ||
# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s | ||
|
||
# FIXME: Running RA is needed otherwise it runs pre-RA LICM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but machinelicm should be the post-rA one? early-machinelicm is pre-RA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason it only triggers if I run RA first, otherwise it seems MRI->isSSA()
returns true and that's what it uses to determine whether it's pre or post RA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are working around the MIR parser bug where it assumes SSA until it finds something obviously not SSA. The pass needs to clear the properties and I think we need an explicit SSA property
; CHECK-NEXT: {{ $}} | ||
; CHECK-NEXT: renamable $q11 = MOVIv4i32 2, 8 | ||
; CHECK-NEXT: BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0 | ||
; CHECK-NEXT: renamable $q10 = MVNIv4i32 4, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this test actually demonstrate? Without this patch, I see that the def of q10
gets hoisted past this memset call:
BL &memset, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11
$h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 $s13 $s14 $s15 $w19 $w20 $w21 $w22 $w23 $w24 $w25 $w26 $w27 $w28 $w29 $w30 $x19 $x20 $x21 $x22 $x23
$x24 $x25 $x26 $x27 $x28 $d8_d9 $d9_d10 $d10_d11 $d11_d12 $d12_d13 $d13_d14 $d14_d15 $d8_d9_d10_d11 $d9_d10_d11_d12 $d10_d11_d12_d13 $d11_d12_d
13_d14 $d12_d13_d14_d15 $d8_d9_d10 $d9_d10_d11 $d10_d11_d12 $d11_d12_d13 $d12_d13_d14 $d13_d14_d15 $x22_x23_x24_x25_x26_x27_x28_fp $x20_x21_x22
_x23_x24_x25_x26_x27 $w20_w21 $w22_w23 $w24_w25 $w26_w27 $w28_w29 $x28_fp $x20_x21 $x22_x23 $x24_x25 $x26_x27>, implicit-def dead $lr, implicit
$sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
I don't understand why this happened. q10
has subregs d20
and d21
but I don't see any of those listed in this regmask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The D20 D21 being subregs of Q10 is an ARM thing that wasn't carried over into AArch64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh - I thought this test case was ARM! Let me take another look...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to get my head around what was going on. There is a rule in the AArch64 PCS which talks about the top bits of a Q register needing to be preserved across a call:
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7,
v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64
bits of each value stored in v8-v15 need to be preserved 7; it is the responsibility of the caller to preserve
larger values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so AArch64 registers work like this: b10
aliases the low 8 bits of h10
which aliases the low 16 bits of s10
which aliases the low 32 bits of d10
which aliases the low 64 bits of q10
which aliases the low 128 bits of z10
.
My suspicion is that TableGen only creates one regunit for all of these *10
registers. If that's true then there's no way for regunit-based clobbering info to express that d10
is preserved but q10
is not (because its high bits are clobbered). So I think we would have to go back and revert #94608.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a flag to registers to tell TableGen that subregs cover the whole register (EDIT: we already have one)
What is that flag that we already have?
Anyway I think this kind of approach sounds reasonable, but should probably get some wider discussion. Possible alternative ideas:
- Declare the width of every physical register, so that tablegen can work out by itself whether or not it is completely covered by subregs.
- Change the AArch64 register definitions to declare some kind of dummy subreg for the high parts, which is only used to force tablegen to generate the extra regunits. The only advantage of this is that we might not have to modify tablegen at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #96146 to continue the discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #96146 to continue the discussion
Thanks. In the mean time I guess it's OK to commit the current patch to fix the AArch64 breakage, with a comment explaining that it's overly conservative (hence the AMDGPU test regression).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello. Sorry, all the subtleties of RegUnits are not something I know a lot about (and I may be a bit unresponsive today, sorry if I am). Perhaps @efriedma-quic or @TNorthover know more about how it should all work. We would usually revert back to a known-good state or fix quickly, and leaving it broken for any length of time isn't the llvm way of doing things. So I'm all in favour of this fix, thanks.
We have scalable registers in AArch64 aliasing the same registers too, to throw another potential spanner in the works. I don't believe those registers actually get marked as scalable at the moment though. If we go with option 3 then we should try and make sure that works OK too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry all, I'm just seeing this thread now.
Apologies if I repeat things that have already been said, I haven't read through the whole history.
I agree that would be nice but that is not what TableGen currently does. I've just checked and all of b10 h10 s10 d10 q10 z10 have a single regunit number 35.
@jayfoad is right this is how the regunits work and the issue with register mask is well known (well at least @MatzeB and I know :P). We shouldn't use regunits for regmask operands because they are indeed not precise enough until we go with option 2.
(And by the way changing regunits shouldn't change regalloc. I'm curious to see why you @Pierre-vh saw this.)
Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU. llvm#96146 opened to investigate the issue.
4792ee5
to
f8da46e
Compare
I updated the patch to include a thorough description of the problem. I would like to move forward with this fix so we don't have a miscompile in trunk while we investigate the issue, without having to revert a patch that solves a major performance issue in MachineLICM in AMDGPU. |
Thanks; with this fix merged now, my original case (as explained in #95746 (comment)) works correctly again. |
Reverts the behavior introduced by 770393b while keeping the refactored code. Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU. llvm#96146 opened to investigate the issue
Reverts the behavior introduced by 770393b while keeping the refactored code.
Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU.
#96146 opened to investigate the issue