Skip to content

[MachineLICM] Work-around Incomplete RegUnits #95926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 34 additions & 8 deletions llvm/lib/CodeGen/MachineLICM.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -426,28 +426,54 @@ static bool InstructionStoresToFI(const MachineInstr *MI, int FI) {
static void applyBitsNotInRegMaskToRegUnitsMask(const TargetRegisterInfo &TRI,
BitVector &RUs,
const uint32_t *Mask) {
BitVector ClobberedRUs(TRI.getNumRegUnits(), true);
// FIXME: This intentionally works in reverse due to some issues with the
// Register Units infrastructure.
//
// This is used to apply callee-saved-register masks to the clobbered regunits
// mask.
//
// The right way to approach this is to start with a BitVector full of ones,
// then reset all the bits of the regunits of each register that is set in the
// mask (registers preserved), then OR the resulting bits with the Clobbers
// mask. This correctly prioritizes the saved registers, so if a RU is shared
// between a register that is preserved, and one that is NOT preserved, that
// RU will not be set in the output vector (the clobbers).
//
// What we have to do for now is the opposite: we have to assume that the
// regunits of all registers that are NOT preserved are clobbered, even if
// those regunits are preserved by another register. So if a RU is shared
// like described previously, that RU will be set.
//
// This is to work around an issue which appears in AArch64, but isn't
// exclusive to that target: AArch64's Qn registers (128 bits) have Dn
// register (lower 64 bits). A few Dn registers are preserved by some calling
// conventions, but Qn and Dn share exactly the same reg units.
//
// If we do this the right way, Qn will be marked as NOT clobbered even though
// its upper 64 bits are NOT preserved. The conservative approach handles this
// correctly at the cost of some missed optimizations on other targets.
//
// This is caused by how RegUnits are handled within TableGen. Ideally, Qn
// should have an extra RegUnit to model the "unknown" bits not covered by the
// subregs.
BitVector RUsFromRegsNotInMask(TRI.getNumRegUnits());
const unsigned NumRegs = TRI.getNumRegs();
const unsigned MaskWords = (NumRegs + 31) / 32;
for (unsigned K = 0; K < MaskWords; ++K) {
const uint32_t Word = Mask[K];
if (!Word)
continue;

for (unsigned Bit = 0; Bit < 32; ++Bit) {
const unsigned PhysReg = (K * 32) + Bit;
if (PhysReg == NumRegs)
break;

// Check if we have a valid PhysReg that is set in the mask.
if ((Word >> Bit) & 1) {
if (PhysReg && !((Word >> Bit) & 1)) {
for (MCRegUnitIterator RUI(PhysReg, &TRI); RUI.isValid(); ++RUI)
ClobberedRUs.reset(*RUI);
RUsFromRegsNotInMask.set(*RUI);
}
}
}

RUs |= ClobberedRUs;
RUs |= RUsFromRegsNotInMask;
}

/// Examine the instruction for potentai LICM candidate. Also
Expand Down
49 changes: 49 additions & 0 deletions llvm/test/CodeGen/AArch64/mlicm-csr-mask.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
# RUN: llc -mtriple=aarch64-unknown-linux-gnu -run-pass=greedy,machinelicm -verify-machineinstrs -debug -o - %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see why this is using -debug, but that requires asserts


# FIXME: Running RA is needed otherwise it runs pre-RA LICM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but machinelicm should be the post-rA one? early-machinelicm is pre-RA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it only triggers if I run RA first, otherwise it seems MRI->isSSA() returns true and that's what it uses to determine whether it's pre or post RA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are working around the MIR parser bug where it assumes SSA until it finds something obviously not SSA. The pass needs to clear the properties and I think we need an explicit SSA property

---
name: test
tracksRegLiveness: true
body: |
; CHECK-LABEL: name: test
; CHECK: bb.0:
; CHECK-NEXT: successors: %bb.1(0x80000000)
; CHECK-NEXT: liveins: $x0, $w1, $x2
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: B %bb.1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
; CHECK-NEXT: liveins: $x0, $w1, $x2
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: renamable $q11 = MOVIv4i32 2, 8
; CHECK-NEXT: BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
; CHECK-NEXT: renamable $q10 = MVNIv4i32 4, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test actually demonstrate? Without this patch, I see that the def of q10 gets hoisted past this memset call:

BL &memset, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11 
$h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 $s13 $s14 $s15 $w19 $w20 $w21 $w22 $w23 $w24 $w25 $w26 $w27 $w28 $w29 $w30 $x19 $x20 $x21 $x22 $x23 
$x24 $x25 $x26 $x27 $x28 $d8_d9 $d9_d10 $d10_d11 $d11_d12 $d12_d13 $d13_d14 $d14_d15 $d8_d9_d10_d11 $d9_d10_d11_d12 $d10_d11_d12_d13 $d11_d12_d
13_d14 $d12_d13_d14_d15 $d8_d9_d10 $d9_d10_d11 $d10_d11_d12 $d11_d12_d13 $d12_d13_d14 $d13_d14_d15 $x22_x23_x24_x25_x26_x27_x28_fp $x20_x21_x22
_x23_x24_x25_x26_x27 $w20_w21 $w22_w23 $w24_w25 $w26_w27 $w28_w29 $x28_fp $x20_x21 $x22_x23 $x24_x25 $x26_x27>, implicit-def dead $lr, implicit
 $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0

I don't understand why this happened. q10 has subregs d20 and d21 but I don't see any of those listed in this regmask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The D20 D21 being subregs of Q10 is an ARM thing that wasn't carried over into AArch64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh - I thought this test case was ARM! Let me take another look...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get my head around what was going on. There is a rule in the AArch64 PCS which talks about the top bits of a Q register needing to be preserved across a call:

Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7,
v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64
bits of each value stored in v8-v15 need to be preserved 7; it is the responsibility of the caller to preserve
larger values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so AArch64 registers work like this: b10 aliases the low 8 bits of h10 which aliases the low 16 bits of s10 which aliases the low 32 bits of d10 which aliases the low 64 bits of q10 which aliases the low 128 bits of z10.

My suspicion is that TableGen only creates one regunit for all of these *10 registers. If that's true then there's no way for regunit-based clobbering info to express that d10 is preserved but q10 is not (because its high bits are clobbered). So I think we would have to go back and revert #94608.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a flag to registers to tell TableGen that subregs cover the whole register (EDIT: we already have one)

What is that flag that we already have?

Anyway I think this kind of approach sounds reasonable, but should probably get some wider discussion. Possible alternative ideas:

  • Declare the width of every physical register, so that tablegen can work out by itself whether or not it is completely covered by subregs.
  • Change the AArch64 register definitions to declare some kind of dummy subreg for the high parts, which is only used to force tablegen to generate the extra regunits. The only advantage of this is that we might not have to modify tablegen at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #96146 to continue the discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #96146 to continue the discussion

Thanks. In the mean time I guess it's OK to commit the current patch to fix the AArch64 breakage, with a comment explaining that it's overly conservative (hence the AMDGPU test regression).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Sorry, all the subtleties of RegUnits are not something I know a lot about (and I may be a bit unresponsive today, sorry if I am). Perhaps @efriedma-quic or @TNorthover know more about how it should all work. We would usually revert back to a known-good state or fix quickly, and leaving it broken for any length of time isn't the llvm way of doing things. So I'm all in favour of this fix, thanks.

We have scalable registers in AArch64 aliasing the same registers too, to throw another potential spanner in the works. I don't believe those registers actually get marked as scalable at the moment though. If we go with option 3 then we should try and make sure that works OK too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry all, I'm just seeing this thread now.
Apologies if I repeat things that have already been said, I haven't read through the whole history.

I agree that would be nice but that is not what TableGen currently does. I've just checked and all of b10 h10 s10 d10 q10 z10 have a single regunit number 35.

@jayfoad is right this is how the regunits work and the issue with register mask is well known (well at least @MatzeB and I know :P). We shouldn't use regunits for regmask operands because they are indeed not precise enough until we go with option 2.

(And by the way changing regunits shouldn't change regalloc. I'm curious to see why you @Pierre-vh saw this.)

; CHECK-NEXT: $xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
; CHECK-NEXT: Bcc 11, %bb.1, implicit $nzcv
; CHECK-NEXT: B %bb.2
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.2:
; CHECK-NEXT: liveins: $q10, $q11
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: $q0 = COPY $q10
; CHECK-NEXT: $q1 = COPY $q11
bb.0:
liveins: $x0, $w1, $x2
B %bb.1

bb.1:
liveins: $x0, $w1, $x2
renamable $q11 = MOVIv4i32 2, 8
BL &memset, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit $w1, implicit $x2, implicit-def $sp, implicit-def $x0
renamable $q10 = MVNIv4i32 4, 0
$xzr = SUBSXri $x0, 1, 0, implicit-def $nzcv
Bcc 11, %bb.1, implicit $nzcv
B %bb.2

bb.2:
liveins: $q10, $q11
$q0 = COPY $q10
$q1 = COPY $q11
...
4 changes: 2 additions & 2 deletions llvm/test/CodeGen/AMDGPU/indirect-call.ll
Original file line number Diff line number Diff line change
Expand Up @@ -886,12 +886,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
; GCN-NEXT: v_writelane_b32 v40, s62, 30
; GCN-NEXT: v_writelane_b32 v40, s63, 31
; GCN-NEXT: s_mov_b64 s[6:7], exec
; GCN-NEXT: s_movk_i32 s4, 0x7b
; GCN-NEXT: .LBB6_1: ; =>This Inner Loop Header: Depth=1
; GCN-NEXT: v_readfirstlane_b32 s8, v0
; GCN-NEXT: v_readfirstlane_b32 s9, v1
; GCN-NEXT: v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
; GCN-NEXT: s_and_saveexec_b64 s[10:11], vcc
; GCN-NEXT: s_movk_i32 s4, 0x7b
; GCN-NEXT: s_swappc_b64 s[30:31], s[8:9]
; GCN-NEXT: ; implicit-def: $vgpr0_vgpr1
; GCN-NEXT: s_xor_b64 exec, exec, s[10:11]
Expand Down Expand Up @@ -980,12 +980,12 @@ define void @test_indirect_call_vgpr_ptr_inreg_arg(ptr %fptr) {
; GISEL-NEXT: v_writelane_b32 v40, s62, 30
; GISEL-NEXT: v_writelane_b32 v40, s63, 31
; GISEL-NEXT: s_mov_b64 s[6:7], exec
; GISEL-NEXT: s_movk_i32 s4, 0x7b
; GISEL-NEXT: .LBB6_1: ; =>This Inner Loop Header: Depth=1
; GISEL-NEXT: v_readfirstlane_b32 s8, v0
; GISEL-NEXT: v_readfirstlane_b32 s9, v1
; GISEL-NEXT: v_cmp_eq_u64_e32 vcc, s[8:9], v[0:1]
; GISEL-NEXT: s_and_saveexec_b64 s[10:11], vcc
; GISEL-NEXT: s_movk_i32 s4, 0x7b
; GISEL-NEXT: s_swappc_b64 s[30:31], s[8:9]
; GISEL-NEXT: ; implicit-def: $vgpr0
; GISEL-NEXT: s_xor_b64 exec, exec, s[10:11]
Expand Down
Loading