Skip to content

[CodeGen] Use 128bits for LaneBitmask. #111157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sdesmalen-arm
Copy link
Collaborator

This follows on from the conversation on #109797.

FWIW, I've considered several alternatives to this patch;

(1) Using APInt as storage type rather than 'uint64_t Mask[2]'.

(1) makes the code a bit simpler to read, but APInt by default only allocates space for a 64-bit value and otherwise dynamically allocates a larger buffer to represent the larger value. Because we know the value is always 128bit, this extra dynamic allocation is undesirable. We also rarely need the full power of APInt since most of the tests are bitwise operations, so I made the choice to represent it as a uint64_t array instead, and only moving to/from APInt when this is necessary.

Because it is inconvenient that increasing the BitWidth applies to all targets, I tried to see if there is a way to make the bitwidth dynamic, by:

(2) Making the BitWidth dynamic per target by passing it to
constructors.
(3) Modification of (2) that describes 'none', 'all' and 'lane' with
an an enum which doesn't require a BitWidth, until doing
arithmetic with an explicit bit mask (which does have a BitWidth).

Unfortunately both these approaches don't seem feasible. For (2) that is because it would require passing the TargetRegisterInfo/MCRegisterInfo to many places where this info is not available, where it needs to instantiates a LaneBitmask value.

Approach (3) leads to other issues such as questions like 'what is the meaning of 'operator==' when one value is a mask and the other is a 'all' enum?' If we let 'operator==' discard the bitwidth such that a 64-bit all-true bitmask == LaneBitmask::all() (using 'all' enum), then we could end up in a situation where:

X == LaneBitmask::all() && Y == LaneBitmask::all()

but X != Y.

I considered replacing the equality operators by methods that take a RegisterInfo pointer, but the LaneBitmask struct is used in STL containers which require a plain 'operator==' or 'operator<'. We could work around that by providing custom lambdas (that call the method with the TargetInfo pointer), but this just gets increasingly more hacky.

Perhaps just using more bits isn't actually that bad in practice.

This follows on from the conversation on llvm#109797.

FWIW, I've considered several alternatives to this patch;

(1) Using APInt as storage type rather than 'uint64_t Mask[2]'.

(1) makes the code a bit simpler to read, but APInt by default
only allocates space for a 64-bit value and otherwise dynamically
allocates a larger buffer to represent the larger value. Because we
know the value is always 128bit, this extra dynamic allocation is
undesirable. We also rarely need the full power of APInt since most of
the tests are bitwise operations, so I made the choice to represent it
as a uint64_t array instead, and only moving to/from APInt when this
is necessary.

Because it is inconvenient that increasing the BitWidth applies to all
targets, I tried to see if there is a way to make the bitwidth dynamic,
by:

(2) Making the BitWidth dynamic per target by passing it to
    constructors.
(3) Modification of (2) that describes 'none', 'all' and 'lane' with
    an an enum which doesn't require a BitWidth, until doing
    arithmetic with an explicit bit mask (which does have a BitWidth).

Unfortunately both these approaches don't seem feasible. For (2) that is
because it would require passing the TargetRegisterInfo/MCRegisterInfo
to many places where this info is not available, where it needs to
instantiates a LaneBitmask value.

Approach (3) leads to other issues such as questions like 'what is the
meaning of 'operator==' when one value is a mask and the other is a
'all' enum?' If we let 'operator==' discard the bitwidth such that a
64-bit all-true bitmask == LaneBitmask::all() (using 'all' enum), then
we could end up in a situation where:

  X == LaneBitmask::all() && Y == LaneBitmask::all()

but `X != Y`.

I considered replacing the equality operators by methods that take
a RegisterInfo pointer, but the LaneBitmask struct is used in STL
containers which require a plain 'operator==' or 'operator<'. We
could work around that by providing custom lambdas (that call the
method with the TargetInfo pointer), but this just gets increasingly
more hacky.

Perhaps just using more bits isn't actually that bad in practice.
@llvmbot
Copy link
Member

llvmbot commented Oct 4, 2024

@llvm/pr-subscribers-tablegen
@llvm/pr-subscribers-mc
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-backend-aarch64

Author: Sander de Smalen (sdesmalen-arm)

Changes

This follows on from the conversation on #109797.

FWIW, I've considered several alternatives to this patch;

(1) Using APInt as storage type rather than 'uint64_t Mask[2]'.

(1) makes the code a bit simpler to read, but APInt by default only allocates space for a 64-bit value and otherwise dynamically allocates a larger buffer to represent the larger value. Because we know the value is always 128bit, this extra dynamic allocation is undesirable. We also rarely need the full power of APInt since most of the tests are bitwise operations, so I made the choice to represent it as a uint64_t array instead, and only moving to/from APInt when this is necessary.

Because it is inconvenient that increasing the BitWidth applies to all targets, I tried to see if there is a way to make the bitwidth dynamic, by:

(2) Making the BitWidth dynamic per target by passing it to
constructors.
(3) Modification of (2) that describes 'none', 'all' and 'lane' with
an an enum which doesn't require a BitWidth, until doing
arithmetic with an explicit bit mask (which does have a BitWidth).

Unfortunately both these approaches don't seem feasible. For (2) that is because it would require passing the TargetRegisterInfo/MCRegisterInfo to many places where this info is not available, where it needs to instantiates a LaneBitmask value.

Approach (3) leads to other issues such as questions like 'what is the meaning of 'operator==' when one value is a mask and the other is a 'all' enum?' If we let 'operator==' discard the bitwidth such that a 64-bit all-true bitmask == LaneBitmask::all() (using 'all' enum), then we could end up in a situation where:

X == LaneBitmask::all() && Y == LaneBitmask::all()

but X != Y.

I considered replacing the equality operators by methods that take a RegisterInfo pointer, but the LaneBitmask struct is used in STL containers which require a plain 'operator==' or 'operator<'. We could work around that by providing custom lambdas (that call the method with the TargetInfo pointer), but this just gets increasingly more hacky.

Perhaps just using more bits isn't actually that bad in practice.


Patch is 20.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111157.diff

12 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/RDFLiveness.h (+2-2)
  • (modified) llvm/include/llvm/CodeGen/RDFRegisters.h (+2-2)
  • (modified) llvm/include/llvm/MC/LaneBitmask.h (+97-49)
  • (modified) llvm/lib/CodeGen/MIRParser/MIParser.cpp (+34-11)
  • (modified) llvm/lib/CodeGen/MIRPrinter.cpp (+8-2)
  • (modified) llvm/lib/CodeGen/RDFRegisters.cpp (+5-5)
  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.h (+4-1)
  • (added) llvm/test/CodeGen/AArch64/lanebitmask.mir (+18)
  • (modified) llvm/unittests/CodeGen/MFCommon.inc (+1-1)
  • (modified) llvm/unittests/MC/CMakeLists.txt (+1-1)
  • (added) llvm/unittests/MC/LaneBitmaskTest.cpp (+68)
  • (modified) llvm/utils/TableGen/RegisterInfoEmitter.cpp (+13-10)
diff --git a/llvm/include/llvm/CodeGen/RDFLiveness.h b/llvm/include/llvm/CodeGen/RDFLiveness.h
index fe1034f9b6f8ef..46ec82e77ba492 100644
--- a/llvm/include/llvm/CodeGen/RDFLiveness.h
+++ b/llvm/include/llvm/CodeGen/RDFLiveness.h
@@ -43,8 +43,8 @@ namespace std {
 
 template <> struct hash<llvm::rdf::detail::NodeRef> {
   std::size_t operator()(llvm::rdf::detail::NodeRef R) const {
-    return std::hash<llvm::rdf::NodeId>{}(R.first) ^
-           std::hash<llvm::LaneBitmask::Type>{}(R.second.getAsInteger());
+    return llvm::hash_value<llvm::rdf::NodeId>(R.first) ^
+           llvm::hash_value(R.second.getAsPair());
   }
 };
 
diff --git a/llvm/include/llvm/CodeGen/RDFRegisters.h b/llvm/include/llvm/CodeGen/RDFRegisters.h
index 7eed0b4e1e7b8f..87c38a215e006e 100644
--- a/llvm/include/llvm/CodeGen/RDFRegisters.h
+++ b/llvm/include/llvm/CodeGen/RDFRegisters.h
@@ -106,8 +106,8 @@ struct RegisterRef {
   }
 
   size_t hash() const {
-    return std::hash<RegisterId>{}(Reg) ^
-           std::hash<LaneBitmask::Type>{}(Mask.getAsInteger());
+    return llvm::hash_value<RegisterId>(Reg) ^
+           llvm::hash_value(Mask.getAsPair());
   }
 
   static constexpr bool isRegId(unsigned Id) {
diff --git a/llvm/include/llvm/MC/LaneBitmask.h b/llvm/include/llvm/MC/LaneBitmask.h
index c06ca7dd5b8fcd..ba22257c650a70 100644
--- a/llvm/include/llvm/MC/LaneBitmask.h
+++ b/llvm/include/llvm/MC/LaneBitmask.h
@@ -29,72 +29,120 @@
 #ifndef LLVM_MC_LANEBITMASK_H
 #define LLVM_MC_LANEBITMASK_H
 
+#include "llvm/ADT/APInt.h"
+#include "llvm/ADT/SmallString.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/Format.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/Printable.h"
 #include "llvm/Support/raw_ostream.h"
+#include <utility>
 
 namespace llvm {
 
-  struct LaneBitmask {
-    // When changing the underlying type, change the format string as well.
-    using Type = uint64_t;
-    enum : unsigned { BitWidth = 8*sizeof(Type) };
-    constexpr static const char *const FormatStr = "%016llX";
+struct LaneBitmask {
+  static constexpr unsigned int BitWidth = 128;
 
-    constexpr LaneBitmask() = default;
-    explicit constexpr LaneBitmask(Type V) : Mask(V) {}
-
-    constexpr bool operator== (LaneBitmask M) const { return Mask == M.Mask; }
-    constexpr bool operator!= (LaneBitmask M) const { return Mask != M.Mask; }
-    constexpr bool operator< (LaneBitmask M)  const { return Mask < M.Mask; }
-    constexpr bool none() const { return Mask == 0; }
-    constexpr bool any()  const { return Mask != 0; }
-    constexpr bool all()  const { return ~Mask == 0; }
-
-    constexpr LaneBitmask operator~() const {
-      return LaneBitmask(~Mask);
-    }
-    constexpr LaneBitmask operator|(LaneBitmask M) const {
-      return LaneBitmask(Mask | M.Mask);
-    }
-    constexpr LaneBitmask operator&(LaneBitmask M) const {
-      return LaneBitmask(Mask & M.Mask);
-    }
-    LaneBitmask &operator|=(LaneBitmask M) {
-      Mask |= M.Mask;
-      return *this;
-    }
-    LaneBitmask &operator&=(LaneBitmask M) {
-      Mask &= M.Mask;
-      return *this;
+  explicit LaneBitmask(APInt V) {
+    switch (V.getBitWidth()) {
+    case BitWidth:
+      Mask[0] = V.getRawData()[0];
+      Mask[1] = V.getRawData()[1];
+      break;
+    default:
+      llvm_unreachable("Unsupported bitwidth");
     }
+  }
+  constexpr explicit LaneBitmask(uint64_t Lo = 0, uint64_t Hi = 0) : Mask{Lo, Hi} {}
 
-    constexpr Type getAsInteger() const { return Mask; }
+  constexpr bool operator==(LaneBitmask M) const {
+    return Mask[0] == M.Mask[0] && Mask[1] == M.Mask[1];
+  }
+  constexpr bool operator!=(LaneBitmask M) const {
+    return Mask[0] != M.Mask[0] || Mask[1] != M.Mask[1];
+  }
+  constexpr bool operator<(LaneBitmask M) const {
+    return Mask[1] < M.Mask[1] || Mask[0] < M.Mask[0];
+  }
+  constexpr bool none() const { return Mask[0] == 0 && Mask[1] == 0; }
+  constexpr bool any() const { return Mask[0] != 0 || Mask[1] != 0; }
+  constexpr bool all() const { return ~Mask[0] == 0 && ~Mask[1] == 0; }
 
-    unsigned getNumLanes() const { return llvm::popcount(Mask); }
-    unsigned getHighestLane() const {
-      return Log2_64(Mask);
-    }
+  constexpr LaneBitmask operator~() const { return LaneBitmask(~Mask[0], ~Mask[1]); }
+  constexpr LaneBitmask operator|(LaneBitmask M) const {
+    return LaneBitmask(Mask[0] | M.Mask[0], Mask[1] | M.Mask[1]);
+  }
+  constexpr LaneBitmask operator&(LaneBitmask M) const {
+    return LaneBitmask(Mask[0] & M.Mask[0], Mask[1] & M.Mask[1]);
+  }
+  LaneBitmask &operator|=(LaneBitmask M) {
+    Mask[0] |= M.Mask[0];
+    Mask[1] |= M.Mask[1];
+    return *this;
+  }
+  LaneBitmask &operator&=(LaneBitmask M) {
+    Mask[0] &= M.Mask[0];
+    Mask[1] &= M.Mask[1];
+    return *this;
+  }
 
-    static constexpr LaneBitmask getNone() { return LaneBitmask(0); }
-    static constexpr LaneBitmask getAll() { return ~LaneBitmask(0); }
-    static constexpr LaneBitmask getLane(unsigned Lane) {
-      return LaneBitmask(Type(1) << Lane);
-    }
+  APInt getAsAPInt() const { return APInt(BitWidth, {Mask[0], Mask[1]}); }
+  constexpr std::pair<uint64_t, uint64_t> getAsPair() const { return {Mask[0], Mask[1]}; }
 
-  private:
-    Type Mask = 0;
-  };
+  unsigned getNumLanes() const {
+    return Mask[1] ? llvm::popcount(Mask[1]) + llvm::popcount(Mask[0])
+                   : llvm::popcount(Mask[0]);
+  }
+  unsigned getHighestLane() const {
+    return Mask[1] ? Log2_64(Mask[1]) + 64 : Log2_64(Mask[0]);
+  }
 
-  /// Create Printable object to print LaneBitmasks on a \ref raw_ostream.
-  inline Printable PrintLaneMask(LaneBitmask LaneMask) {
-    return Printable([LaneMask](raw_ostream &OS) {
-      OS << format(LaneBitmask::FormatStr, LaneMask.getAsInteger());
-    });
+  static constexpr LaneBitmask getNone() { return LaneBitmask(0, 0); }
+  static constexpr LaneBitmask getAll() { return ~LaneBitmask(0, 0); }
+  static constexpr LaneBitmask getLane(unsigned Lane) {
+    return Lane >= 64 ? LaneBitmask(0, 1ULL << (Lane % 64))
+                      : LaneBitmask(1ULL << Lane, 0);
   }
 
+private:
+  uint64_t Mask[2];
+};
+
+/// Create Printable object to print LaneBitmasks on a \ref raw_ostream.
+/// If \p FormatAsCLiterals is true, it will print the bitmask as
+/// a hexadecimal C literal with zero padding, or a list of such C literals if
+/// the value cannot be represented in 64 bits.
+/// For example (FormatAsCliterals == true)
+///   bitmask '1'       => "0x0000000000000001"
+///   bitmask '1 << 64' => "0x0000000000000000,0x0000000000000001"
+/// (FormatAsCLiterals == false)
+///   bitmask '1'       => "00000000000000000000000000000001"
+///   bitmask '1 << 64' => "00000000000000010000000000000000"
+inline Printable PrintLaneMask(LaneBitmask LaneMask,
+                               bool FormatAsCLiterals = false) {
+  return Printable([LaneMask, FormatAsCLiterals](raw_ostream &OS) {
+    SmallString<64> Buffer;
+    APInt V = LaneMask.getAsAPInt();
+    while (true) {
+      unsigned Bitwidth = FormatAsCLiterals ? 64 : LaneBitmask::BitWidth;
+      APInt VToPrint = V.trunc(Bitwidth);
+
+      Buffer.clear();
+      VToPrint.toString(Buffer, 16, /*Signed=*/false,
+                        /*formatAsCLiteral=*/false);
+      unsigned NumZeroesToPad =
+          (VToPrint.countLeadingZeros() / 4) - VToPrint.isZero();
+      OS << (FormatAsCLiterals ? "0x" : "") << std::string(NumZeroesToPad, '0')
+         << Buffer.str();
+      V = V.lshr(Bitwidth);
+      if (V.getActiveBits())
+        OS << ",";
+      else
+        break;
+    }
+  });
+}
+
 } // end namespace llvm
 
 #endif // LLVM_MC_LANEBITMASK_H
diff --git a/llvm/lib/CodeGen/MIRParser/MIParser.cpp b/llvm/lib/CodeGen/MIRParser/MIParser.cpp
index 27f0a9331a3e3e..6b6b5be910fdc4 100644
--- a/llvm/lib/CodeGen/MIRParser/MIParser.cpp
+++ b/llvm/lib/CodeGen/MIRParser/MIParser.cpp
@@ -870,17 +870,40 @@ bool MIParser::parseBasicBlockLiveins(MachineBasicBlock &MBB) {
     lex();
     LaneBitmask Mask = LaneBitmask::getAll();
     if (consumeIfPresent(MIToken::colon)) {
-      // Parse lane mask.
-      if (Token.isNot(MIToken::IntegerLiteral) &&
-          Token.isNot(MIToken::HexLiteral))
-        return error("expected a lane mask");
-      static_assert(sizeof(LaneBitmask::Type) == sizeof(uint64_t),
-                    "Use correct get-function for lane mask");
-      LaneBitmask::Type V;
-      if (getUint64(V))
-        return error("invalid lane mask value");
-      Mask = LaneBitmask(V);
-      lex();
+      if (consumeIfPresent(MIToken::lparen)) {
+        // We need to parse a list of literals
+        SmallVector<uint64_t, 2> Literals;
+        while (true) {
+          if (Token.isNot(MIToken::HexLiteral))
+            return error("expected a lane mask");
+          APInt V;
+          getHexUint(V);
+          Literals.push_back(V.getZExtValue());
+          // Lex past literal
+          lex();
+          if (Token.is(MIToken::rparen))
+            break;
+          else if (Token.isNot(MIToken::comma))
+            return error("expected a comma");
+          // Lex past comma
+          lex();
+        }
+        // Lex past rparen
+        lex();
+        Mask = LaneBitmask(APInt(LaneBitmask::BitWidth, Literals));
+      } else {
+        // Parse lane mask.
+        APInt V;
+        if (Token.is(MIToken::IntegerLiteral)) {
+          uint64_t UV;
+          if (getUint64(UV))
+            return error("invalid lane mask value");
+          V = APInt(LaneBitmask::BitWidth, UV);
+        } else if (getHexUint(V))
+          return error("expected a lane mask");
+        Mask = LaneBitmask(APInt(LaneBitmask::BitWidth, V.getZExtValue()));
+        lex();
+      }
     }
     MBB.addLiveIn(Reg, Mask);
   } while (consumeIfPresent(MIToken::comma));
diff --git a/llvm/lib/CodeGen/MIRPrinter.cpp b/llvm/lib/CodeGen/MIRPrinter.cpp
index cf6122bce22364..349e3aaa87f72a 100644
--- a/llvm/lib/CodeGen/MIRPrinter.cpp
+++ b/llvm/lib/CodeGen/MIRPrinter.cpp
@@ -732,8 +732,14 @@ void MIPrinter::print(const MachineBasicBlock &MBB) {
         OS << ", ";
       First = false;
       OS << printReg(LI.PhysReg, &TRI);
-      if (!LI.LaneMask.all())
-        OS << ":0x" << PrintLaneMask(LI.LaneMask);
+      if (!LI.LaneMask.all()) {
+        OS << ":";
+        if (LI.LaneMask.getAsAPInt().getActiveBits() <= 64)
+          OS << PrintLaneMask(LI.LaneMask, /*FormatAsCLiterals=*/true);
+        else
+          OS << '(' << PrintLaneMask(LI.LaneMask, /*FormatAsCLiterals=*/true)
+             << ')';
+      }
     }
     OS << "\n";
     HasLineAttributes = true;
diff --git a/llvm/lib/CodeGen/RDFRegisters.cpp b/llvm/lib/CodeGen/RDFRegisters.cpp
index 7ce00a66b3ae6c..3229647379207f 100644
--- a/llvm/lib/CodeGen/RDFRegisters.cpp
+++ b/llvm/lib/CodeGen/RDFRegisters.cpp
@@ -412,11 +412,11 @@ raw_ostream &operator<<(raw_ostream &OS, const PrintLaneMaskShort &P) {
   if (P.Mask.none())
     return OS << ":*none*";
 
-  LaneBitmask::Type Val = P.Mask.getAsInteger();
-  if ((Val & 0xffff) == Val)
-    return OS << ':' << format("%04llX", Val);
-  if ((Val & 0xffffffff) == Val)
-    return OS << ':' << format("%08llX", Val);
+  APInt Val = P.Mask.getAsAPInt();
+  if (Val.getActiveBits() <= 16)
+    return OS << ':' << format("%04llX", Val.getZExtValue());
+  if (Val.getActiveBits() <= 32)
+    return OS << ':' << format("%08llX", Val.getZExtValue());
   return OS << ':' << PrintLaneMask(P.Mask);
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
index 88d5686720985e..a7796906779594 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
@@ -377,7 +377,10 @@ class SIRegisterInfo final : public AMDGPUGenRegisterInfo {
   static unsigned getNumCoveredRegs(LaneBitmask LM) {
     // The assumption is that every lo16 subreg is an even bit and every hi16
     // is an adjacent odd bit or vice versa.
-    uint64_t Mask = LM.getAsInteger();
+    APInt MaskV = LM.getAsAPInt();
+    assert(MaskV.getActiveBits() <= 64 &&
+           "uint64_t is insufficient to represent lane bitmask operation");
+    uint64_t Mask = MaskV.getZExtValue();
     uint64_t Even = Mask & 0xAAAAAAAAAAAAAAAAULL;
     Mask = (Even >> 1) | Mask;
     uint64_t Odd = Mask & 0x5555555555555555ULL;
diff --git a/llvm/test/CodeGen/AArch64/lanebitmask.mir b/llvm/test/CodeGen/AArch64/lanebitmask.mir
new file mode 100644
index 00000000000000..13178ea5cff4e3
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/lanebitmask.mir
@@ -0,0 +1,18 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -o - %s -mtriple=aarch64 -stop-before=greedy | FileCheck %s
+---
+name:            test_parse_lanebitmask
+tracksRegLiveness: true
+liveins:
+  - { reg: '$h0' }
+  - { reg: '$s1' }
+body:             |
+  bb.0:
+    liveins: $h0:0x0000000000000001, $s1:(0x0000000000000001,0x0000000000000000)
+    ; CHECK-LABEL: name: test_parse_lanebitmask
+    ; CHECK: liveins: $h0:0x0000000000000001, $s1:0x0000000000000001, $h0, $s1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: RET_ReallyLR
+    RET_ReallyLR
+...
+
diff --git a/llvm/unittests/CodeGen/MFCommon.inc b/llvm/unittests/CodeGen/MFCommon.inc
index 5d5720c3162da9..720c5e43586d7e 100644
--- a/llvm/unittests/CodeGen/MFCommon.inc
+++ b/llvm/unittests/CodeGen/MFCommon.inc
@@ -23,7 +23,7 @@ class BogusRegisterInfo : public TargetRegisterInfo {
 public:
   BogusRegisterInfo()
       : TargetRegisterInfo(nullptr, BogusRegisterClasses, BogusRegisterClasses,
-                           nullptr, nullptr, nullptr, LaneBitmask(~0u), nullptr,
+                           nullptr, nullptr, nullptr, LaneBitmask::getAll(), nullptr,
                            nullptr) {
     InitMCRegisterInfo(nullptr, 0, 0, 0, nullptr, 0, nullptr, 0, nullptr,
                        nullptr, nullptr, nullptr, nullptr, 0, nullptr);
diff --git a/llvm/unittests/MC/CMakeLists.txt b/llvm/unittests/MC/CMakeLists.txt
index da8e219113f465..fabf0f8512786a 100644
--- a/llvm/unittests/MC/CMakeLists.txt
+++ b/llvm/unittests/MC/CMakeLists.txt
@@ -17,9 +17,9 @@ add_llvm_unittest(MCTests
   Disassembler.cpp
   DwarfLineTables.cpp
   DwarfLineTableHeaders.cpp
+  LaneBitmaskTest.cpp
   MCInstPrinter.cpp
   StringTableBuilderTest.cpp
   TargetRegistry.cpp
   MCDisassemblerTest.cpp
   )
-
diff --git a/llvm/unittests/MC/LaneBitmaskTest.cpp b/llvm/unittests/MC/LaneBitmaskTest.cpp
new file mode 100644
index 00000000000000..d7762122e03220
--- /dev/null
+++ b/llvm/unittests/MC/LaneBitmaskTest.cpp
@@ -0,0 +1,68 @@
+//===------------------ LaneBitmaskTest.cpp -------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "gtest/gtest.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Support/raw_ostream.h"
+#include <string>
+
+using namespace llvm;
+
+TEST(LaneBitmaskTest, Basic) {
+  EXPECT_EQ(LaneBitmask::getAll(), ~LaneBitmask::getNone());
+  EXPECT_EQ(LaneBitmask::getNone(), ~LaneBitmask::getAll());
+  EXPECT_EQ(LaneBitmask::getLane(0) | LaneBitmask::getLane(1), LaneBitmask(3));
+  EXPECT_EQ(LaneBitmask(3) & LaneBitmask::getLane(1), LaneBitmask::getLane(1));
+
+  EXPECT_EQ(LaneBitmask(APInt(128, 42)).getAsAPInt(), APInt(128, 42));
+  EXPECT_EQ(LaneBitmask(3).getNumLanes(), 2);
+  EXPECT_EQ(LaneBitmask::getLane(0).getHighestLane(), 0);
+  EXPECT_EQ(LaneBitmask::getLane(64).getHighestLane(), 64);
+  EXPECT_EQ(LaneBitmask::getLane(127).getHighestLane(), 127);
+
+  EXPECT_LT(LaneBitmask::getLane(64), LaneBitmask::getLane(65));
+  EXPECT_LT(LaneBitmask::getLane(63), LaneBitmask::getLane(64));
+  EXPECT_LT(LaneBitmask::getLane(62), LaneBitmask::getLane(63));
+
+  LaneBitmask X(1);
+  X |= LaneBitmask(2);
+  EXPECT_EQ(X, LaneBitmask(3));
+
+  LaneBitmask Y(3);
+  Y &= LaneBitmask(1);
+  EXPECT_EQ(Y, LaneBitmask(1));
+}
+
+TEST(LaneBitmaskTest, Print) {
+  std::string buffer;
+  raw_string_ostream OS(buffer);
+
+  buffer = "";
+  OS << PrintLaneMask(LaneBitmask::getAll(), /*FormatAsCLiterals=*/true);
+  EXPECT_STREQ(OS.str().data(), "0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF");
+
+  buffer = "";
+  OS << PrintLaneMask(LaneBitmask::getAll(), /*FormatAsCLiterals=*/false);
+  EXPECT_STREQ(OS.str().data(), "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF");
+
+  buffer = "";
+  OS << PrintLaneMask(LaneBitmask::getLane(0), /*FormatAsCLiterals=*/true);
+  EXPECT_STREQ(OS.str().data(), "0x0000000000000001");
+
+  buffer = "";
+  OS << PrintLaneMask(LaneBitmask::getLane(63), /*FormatAsCLiterals=*/true);
+  EXPECT_STREQ(OS.str().data(), "0x8000000000000000");
+
+  buffer = "";
+  OS << PrintLaneMask(LaneBitmask::getNone(), /*FormatAsCLiterals=*/true);
+  EXPECT_STREQ(OS.str().data(), "0x0000000000000000");
+
+  buffer = "";
+  OS << PrintLaneMask(LaneBitmask::getLane(64), /*FormatAsCLiterals=*/true);
+  EXPECT_STREQ(OS.str().data(), "0x0000000000000000,0x0000000000000001");
+}
diff --git a/llvm/utils/TableGen/RegisterInfoEmitter.cpp b/llvm/utils/TableGen/RegisterInfoEmitter.cpp
index 7d81a83ef2b0a6..bf8aea096c1cf1 100644
--- a/llvm/utils/TableGen/RegisterInfoEmitter.cpp
+++ b/llvm/utils/TableGen/RegisterInfoEmitter.cpp
@@ -653,7 +653,8 @@ static DiffVec &diffEncode(DiffVec &V, unsigned InitVal, Iter Begin, Iter End) {
 static void printDiff16(raw_ostream &OS, int16_t Val) { OS << Val; }
 
 static void printMask(raw_ostream &OS, LaneBitmask Val) {
-  OS << "LaneBitmask(0x" << PrintLaneMask(Val) << ')';
+  OS << "LaneBitmask("
+     << PrintLaneMask(Val, /*FormatAsCLiteral=*/true) << ")";
 }
 
 // Try to combine Idx's compose map into Vec if it is compatible.
@@ -818,11 +819,11 @@ void RegisterInfoEmitter::emitComposeSubRegIndexLaneMask(raw_ostream &OS,
         "  for (const MaskRolOp *Ops =\n"
         "       &LaneMaskComposeSequences[CompositeSequences[IdxA]];\n"
         "       Ops->Mask.any(); ++Ops) {\n"
-        "    LaneBitmask::Type M = LaneMask.getAsInteger() & "
-        "Ops->Mask.getAsInteger();\n"
+        "    APInt M = LaneMask.getAsAPInt() & "
+        "Ops->Mask.getAsAPInt();\n"
         "    if (unsigned S = Ops->RotateLeft)\n"
-        "      Result |= LaneBitmask((M << S) | (M >> (LaneBitmask::BitWidth - "
-        "S)));\n"
+        "      Result |= LaneBitmask(M.shl(S) | M.lshr(LaneBitmask::BitWidth - "
+        "S));\n"
         "    else\n"
         "      Result |= LaneBitmask(M);\n"
         "  }\n"
@@ -840,10 +841,10 @@ void RegisterInfoEmitter::emitComposeSubRegIndexLaneMask(raw_ostream &OS,
         "  for (const MaskRolOp *Ops =\n"
         "       &LaneMaskComposeSequences[CompositeSequences[IdxA]];\n"
         "       Ops->Mask.any(); ++Ops) {\n"
-        "    LaneBitmask::Type M = LaneMask.getAsInteger();\n"
+        "    APInt M = LaneMask.getAsAPInt();\n"
         "    if (unsigned S = Ops->RotateLeft)\n"
-        "      Result |= LaneBitmask((M >> S) | (M << (LaneBitmask::BitWidth - "
-        "S)));\n"
+        "      Result |= LaneBitmask(M.lshr(S) | M.shl(LaneBitmask::BitWidth - "
+        "S));\n"
         "    else\n"
         "      Result |= LaneBitmask(M);\n"
         "  }\n"
@@ -1836,7 +1837,8 @@ void RegisterInfoEmitter::debugDump(raw_ostream &OS) {
     for (unsigned M = 0; M != NumModes; ++M)
       OS << ' ' << getModeName(M) << ':' << RC.RSI.get(M).SpillAlignment;
     OS << " }\n\tNumRegs: " << RC.getMembers().size() << '\n';
-    OS << "\tLaneMask: " << PrintLaneMask(RC.LaneMask) << '\n';
+    OS << "\tLaneMask: {"
+       << PrintLaneMask(RC.LaneMask, /*FormatAsCLiteral=*/true) << "}\n";
     OS << "\tHasDisjunctSubRegs: " << RC.HasDisjunctSubRegs << '\n';
     OS << "\tCoveredBySubRegs: " << RC.CoveredBySubRegs << '\n';
     OS << "\tAllocatable: " << RC.Allocatable << '\n';
@@ -1864,7 +1866,8 @@ void RegisterInfoEmitter::debugDump(raw_ostream &OS) {
 
   for (const CodeGenSubRegIndex &SRI : RegBank.getSubRegIndices()) {
     OS << "SubRegIndex " << SRI.getName() << ":\n";
-    OS << "\tLaneMask: " << PrintLaneMask(SRI.LaneMask) << '\n';
+    OS << "\tLaneMask: {"
+       << PrintLaneMask(SRI.LaneMask, /*FormatAsCLiteral=*/true) << "}\n";
     OS << "\tAllSuperRegsCovered: " << SRI.AllSuperRegsCovered << '\n';
     OS << "\tOffset: {";
     for (unsigned M = 0; M != NumMod...
[truncated]

Copy link

github-actions bot commented Oct 4, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 206408732bca2ef464732a39c8319d47c8a1dbea 136d33b74a1e18fe1b3675135907c6ceb4b80e15 --extensions h,inc,cpp -- llvm/unittests/MC/LaneBitmaskTest.cpp llvm/include/llvm/CodeGen/RDFLiveness.h llvm/include/llvm/CodeGen/RDFRegisters.h llvm/include/llvm/MC/LaneBitmask.h llvm/lib/CodeGen/MIRParser/MIParser.cpp llvm/lib/CodeGen/MIRPrinter.cpp llvm/lib/CodeGen/RDFRegisters.cpp llvm/lib/Target/AMDGPU/SIRegisterInfo.h llvm/unittests/CodeGen/MFCommon.inc llvm/utils/TableGen/RegisterInfoEmitter.cpp
View the diff from clang-format here.
diff --git a/llvm/include/llvm/MC/LaneBitmask.h b/llvm/include/llvm/MC/LaneBitmask.h
index 746f362f0d..89c0d9fa59 100644
--- a/llvm/include/llvm/MC/LaneBitmask.h
+++ b/llvm/include/llvm/MC/LaneBitmask.h
@@ -53,7 +53,8 @@ struct LaneBitmask {
       llvm_unreachable("Unsupported bitwidth");
     }
   }
-  constexpr explicit LaneBitmask(uint64_t Lo = 0, uint64_t Hi = 0) : Mask{Lo, Hi} {}
+  constexpr explicit LaneBitmask(uint64_t Lo = 0, uint64_t Hi = 0)
+      : Mask{Lo, Hi} {}
 
   constexpr bool operator==(LaneBitmask M) const {
     return Mask[0] == M.Mask[0] && Mask[1] == M.Mask[1];
@@ -68,7 +69,9 @@ struct LaneBitmask {
   constexpr bool any() const { return Mask[0] != 0 || Mask[1] != 0; }
   constexpr bool all() const { return ~Mask[0] == 0 && ~Mask[1] == 0; }
 
-  constexpr LaneBitmask operator~() const { return LaneBitmask(~Mask[0], ~Mask[1]); }
+  constexpr LaneBitmask operator~() const {
+    return LaneBitmask(~Mask[0], ~Mask[1]);
+  }
   constexpr LaneBitmask operator|(LaneBitmask M) const {
     return LaneBitmask(Mask[0] | M.Mask[0], Mask[1] | M.Mask[1]);
   }
@@ -87,7 +90,9 @@ struct LaneBitmask {
   }
 
   APInt getAsAPInt() const { return APInt(BitWidth, {Mask[0], Mask[1]}); }
-  constexpr std::pair<uint64_t, uint64_t> getAsPair() const { return {Mask[0], Mask[1]}; }
+  constexpr std::pair<uint64_t, uint64_t> getAsPair() const {
+    return {Mask[0], Mask[1]};
+  }
 
   unsigned getNumLanes() const {
     return Mask[1] ? llvm::popcount(Mask[1]) + llvm::popcount(Mask[0])
diff --git a/llvm/unittests/CodeGen/MFCommon.inc b/llvm/unittests/CodeGen/MFCommon.inc
index 720c5e4358..583c0281b7 100644
--- a/llvm/unittests/CodeGen/MFCommon.inc
+++ b/llvm/unittests/CodeGen/MFCommon.inc
@@ -23,8 +23,8 @@ class BogusRegisterInfo : public TargetRegisterInfo {
 public:
   BogusRegisterInfo()
       : TargetRegisterInfo(nullptr, BogusRegisterClasses, BogusRegisterClasses,
-                           nullptr, nullptr, nullptr, LaneBitmask::getAll(), nullptr,
-                           nullptr) {
+                           nullptr, nullptr, nullptr, LaneBitmask::getAll(),
+                           nullptr, nullptr) {
     InitMCRegisterInfo(nullptr, 0, 0, 0, nullptr, 0, nullptr, 0, nullptr,
                        nullptr, nullptr, nullptr, nullptr, 0, nullptr);
   }
diff --git a/llvm/unittests/MC/LaneBitmaskTest.cpp b/llvm/unittests/MC/LaneBitmaskTest.cpp
index 4db9ed5a74..1c7c01943d 100644
--- a/llvm/unittests/MC/LaneBitmaskTest.cpp
+++ b/llvm/unittests/MC/LaneBitmaskTest.cpp
@@ -6,9 +6,9 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "gtest/gtest.h"
 #include "llvm/MC/LaneBitmask.h"
 #include "llvm/Support/raw_ostream.h"
+#include "gtest/gtest.h"
 #include <string>
 
 using namespace llvm;
@@ -28,7 +28,8 @@ TEST(LaneBitmaskTest, Basic) {
   EXPECT_LT(LaneBitmask::getLane(64), LaneBitmask::getLane(65));
   EXPECT_LT(LaneBitmask::getLane(63), LaneBitmask::getLane(64));
   EXPECT_LT(LaneBitmask::getLane(62), LaneBitmask::getLane(63));
-  EXPECT_LT(LaneBitmask::getLane(64), LaneBitmask::getLane(64) | LaneBitmask::getLane(0));
+  EXPECT_LT(LaneBitmask::getLane(64),
+            LaneBitmask::getLane(64) | LaneBitmask::getLane(0));
 
   LaneBitmask X(1);
   X |= LaneBitmask(2);
diff --git a/llvm/utils/TableGen/RegisterInfoEmitter.cpp b/llvm/utils/TableGen/RegisterInfoEmitter.cpp
index bf8aea096c..4a5b05b381 100644
--- a/llvm/utils/TableGen/RegisterInfoEmitter.cpp
+++ b/llvm/utils/TableGen/RegisterInfoEmitter.cpp
@@ -653,8 +653,7 @@ static DiffVec &diffEncode(DiffVec &V, unsigned InitVal, Iter Begin, Iter End) {
 static void printDiff16(raw_ostream &OS, int16_t Val) { OS << Val; }
 
 static void printMask(raw_ostream &OS, LaneBitmask Val) {
-  OS << "LaneBitmask("
-     << PrintLaneMask(Val, /*FormatAsCLiteral=*/true) << ")";
+  OS << "LaneBitmask(" << PrintLaneMask(Val, /*FormatAsCLiteral=*/true) << ")";
 }
 
 // Try to combine Idx's compose map into Vec if it is compatible.

@jayfoad
Copy link
Contributor

jayfoad commented Oct 4, 2024

Can you test this with compile-time-tracker? Cc @nikic

@arsenm
Copy link
Contributor

arsenm commented Oct 4, 2024

Can you test this with compile-time-tracker? Cc @nikic

I bet it will give worse results on amdgpu, or other targets that actually enable subreg liveness. We really need an amdgpu compile time tracker

@arsenm
Copy link
Contributor

arsenm commented Oct 4, 2024

I still don't understand why AArch64 needs so many bits. Having sub registers that alias does not mean you need additional register units. You should only need one for each physically distinct bits, despite differences in access

return Mask[0] != M.Mask[0] || Mask[1] != M.Mask[1];
}
constexpr bool operator<(LaneBitmask M) const {
return Mask[1] < M.Mask[1] || Mask[0] < M.Mask[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had changed this locally but forgot to squash the change!

Type Mask = 0;
};
unsigned getNumLanes() const {
return Mask[1] ? llvm::popcount(Mask[1]) + llvm::popcount(Mask[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope you could rely on modern host machines having a fast popcount instruction so you don't need to special-case Mask[1] here, but I'm not sure.

};

/// Create Printable object to print LaneBitmasks on a \ref raw_ostream.
/// If \p FormatAsCLiterals is true, it will print the bitmask as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the option? Please can we just print it as a single hex literal with lots of digits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly to avoid having to update many tests that rely on a pretty printed format. Changing it would require changing various tests (which I'm happy to do if that's preferred).

@jayfoad
Copy link
Contributor

jayfoad commented Oct 4, 2024

FYI I once added a similar 128-bit-int-like class here:

Maybe one day it could migrate into ADT, if there is enough commonality.

@nikic
Copy link
Contributor

nikic commented Oct 4, 2024

Would using std::bitset<128> be an option here?

@sdesmalen-arm
Copy link
Collaborator Author

I still don't understand why AArch64 needs so many bits. Having sub registers that alias does not mean you need additional register units. You should only need one for each physically distinct bits, despite differences in access

I think this is mostly because defining register tuples (2x, 3x and 4x) replicates the regunits. When I define the top bits and do some post-processing of the table in AArch64GenRegisterInfo.inc, I get the following lane masks:

0x00000000000000000001 // bsub
0x00000000000000000002 // bsub_hi
0x00000000000000000004 // dsub_hi
0x00000000000000000008 // hsub_hi
0x00000000000000000010 // psub
0x00000000000000000020 // qsub_hi
0x00000000000000000040 // ssub_hi
0x00000000000000000080 // sub_32, sube64, x8sub_0
0x00000000000000000100 // sube32
0x00000000000000000200 // subo32
0x00000000000000000400 // zasubq0
0x00000000000000000800 // zasubq1
0x00000000000000001000 // zasubd1_then_zasubq0
0x00000000000000002000 // zasubd1_then_zasubq1
0x00000000000000004000 // zasubs1_then_zasubq0
0x00000000000000008000 // zasubs1_then_zasubq1
0x00000000000000010000 // zasubs1_then_zasubd1_then_zasubq0
0x00000000000000020000 // zasubs1_then_zasubd1_then_zasubq1
0x00000000000000040000 // zasubh1_then_zasubq0
0x00000000000000080000 // zasubh1_then_zasubq1
0x00000000000000100000 // zasubh1_then_zasubd1_then_zasubq0
0x00000000000000200000 // zasubh1_then_zasubd1_then_zasubq1
0x00000000000000400000 // zasubh1_then_zasubs1_then_zasubq0
0x00000000000000800000 // zasubh1_then_zasubs1_then_zasubq1
0x00000000000001000000 // zasubh1_then_zasubs1_then_zasubd1_then_zasubq0
0x00000000000002000000 // zasubh1_then_zasubs1_then_zasubd1_then_zasubq1
0x00000000000004000000 // dsub1_then_bsub
0x00000000000008000000 // dsub1_then_bsub_hi
0x00000000000010000000 // dsub1_then_hsub_hi
0x00000000000020000000 // dsub1_then_ssub_hi
0x00000000000040000000 // dsub3_then_bsub
0x00000000000080000000 // dsub3_then_bsub_hi
0x00000000000100000000 // dsub3_then_hsub_hi
0x00000000000200000000 // dsub3_then_ssub_hi
0x00000000000400000000 // dsub2_then_bsub
0x00000000000800000000 // dsub2_then_bsub_hi
0x00000000001000000000 // dsub2_then_hsub_hi
0x00000000002000000000 // dsub2_then_ssub_hi
0x00000000004000000000 // psub1, psub1_then_psub
0x00000000008000000000 // qsub1_then_bsub
0x00000000010000000000 // qsub1_then_bsub_hi
0x00000000020000000000 // qsub1_then_dsub_hi
0x00000000040000000000 // qsub1_then_hsub_hi
0x00000000080000000000 // qsub1_then_ssub_hi
0x00000000100000000000 // qsub3_then_bsub
0x00000000200000000000 // qsub3_then_bsub_hi
0x00000000400000000000 // qsub3_then_dsub_hi
0x00000000800000000000 // qsub3_then_hsub_hi
0x00000001000000000000 // qsub3_then_ssub_hi
0x00000002000000000000 // qsub2_then_bsub
0x00000004000000000000 // qsub2_then_bsub_hi
0x00000008000000000000 // qsub2_then_dsub_hi
0x00000010000000000000 // qsub2_then_hsub_hi
0x00000020000000000000 // qsub2_then_ssub_hi
0x00000040000000000000 // x8sub_7, x8sub_7_then_sub_32
0x00000080000000000000 // x8sub_6, x8sub_6_then_sub_32
0x00000100000000000000 // x8sub_5, x8sub_5_then_sub_32
0x00000200000000000000 // x8sub_4, x8sub_4_then_sub_32
0x00000400000000000000 // x8sub_3, x8sub_3_then_sub_32
0x00000800000000000000 // x8sub_2, x8sub_2_then_sub_32
0x00001000000000000000 // x8sub_1, x8sub_1_then_sub_32
0x00002000000000000000 // subo64, subo64_then_sub_32
0x00004000000000000000 // zsub1_then_bsub
0x00008000000000000000 // zsub1_then_bsub_hi
0x00010000000000000000 // zsub1_then_dsub_hi
0x00020000000000000000 // zsub1_then_hsub_hi
0x00040000000000000000 // zsub1_then_qsub_hi
0x00080000000000000000 // zsub1_then_ssub_hi
0x00100000000000000000 // zsub3_then_bsub
0x00200000000000000000 // zsub3_then_bsub_hi
0x00400000000000000000 // zsub3_then_dsub_hi
0x00800000000000000000 // zsub3_then_hsub_hi
0x01000000000000000000 // zsub3_then_qsub_hi
0x02000000000000000000 // zsub3_then_ssub_hi
0x04000000000000000000 // zsub2_then_bsub
0x08000000000000000000 // zsub2_then_bsub_hi
0x10000000000000000000 // zsub2_then_dsub_hi
0x20000000000000000000 // zsub2_then_hsub_hi
0x40000000000000000000 // zsub2_then_qsub_hi
0x80000000000000000000 // zsub2_then_ssub_hi

Where:

  • zasub => Matrix registers
  • bsub/hsub/ssub/dsub/qsub => FP/vector registers
  • zsub => SVE (scalable data vectors)
  • psub => SVE (scalable predicate vectors)
  • anything else => GPR registers

@arsenm
Copy link
Contributor

arsenm commented Oct 4, 2024

I think this is mostly because defining register tuples (2x, 3x and 4x) replicates the regunits. When I define the top bits and do some post-processing of the table in AArch64GenRegisterInfo.inc, I get the following lane masks:

This doesn't sound right. AMDGPU nearly exclusively uses register tuples, and we get one regunit per lane (well, one for each 16-bit half of each lane).

@sdesmalen-arm
Copy link
Collaborator Author

Would using std::bitset<128> be an option here?

Yes, I agree that's better than reinventing. I guess llvm::bitset would be preferred over std::bitset though?

@sdesmalen-arm
Copy link
Collaborator Author

I think this is mostly because defining register tuples (2x, 3x and 4x) replicates the regunits. When I define the top bits and do some post-processing of the table in AArch64GenRegisterInfo.inc, I get the following lane masks:

This doesn't sound right. AMDGPU nearly exclusively uses register tuples, and we get one regunit per lane (well, one for each 16-bit half of each lane).

I'm not really sure what AMDGPU does that is different or how it encodes the information more efficiently. Are there any lane masks in the table I shared above that you believe use unnecessary regunits?

@arsenm
Copy link
Contributor

arsenm commented Oct 4, 2024

I'm not really sure what AMDGPU does that is different or how it encodes the information more efficiently. Are there any lane masks in the table I shared above that you believe use unnecessary regunits?

This table seems to have one bit for every subregister index. I would expect overlapping tuples to use multiple bits of mask. e.g. AMDGPU has this:

hi16 :   L0000000000000001 EMPTY
lo16 :   L0000000000000002 EMPTY
sub0 :   L0000000000000003 EMPTY
sub0_sub1 :   L000000000000000F EMPTY
sub0_sub1_sub2 :   L000000000000003F EMPTY
...

explicit LaneBitmask(APInt V) {
switch (V.getBitWidth()) {
case BitWidth:
Mask[0] = V.getRawData()[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should use APInt::extractBitsAsZExtValue() instead exposing APInts internals.

@sdesmalen-arm
Copy link
Collaborator Author

I'm not really sure what AMDGPU does that is different or how it encodes the information more efficiently. Are there any lane masks in the table I shared above that you believe use unnecessary regunits?

This table seems to have one bit for every subregister index. I would expect overlapping tuples to use multiple bits of mask. e.g. AMDGPU has this:

hi16 :   L0000000000000001 EMPTY
lo16 :   L0000000000000002 EMPTY
sub0 :   L0000000000000003 EMPTY
sub0_sub1 :   L000000000000000F EMPTY
sub0_sub1_sub2 :   L000000000000003F EMPTY
...

For the following acronyms:

  • bl = bsub low 8 bits
  • bh = bsub high 8 bits
  • hh = hsub high 16 bits
  • sh = ssub high 32 bits
  • dh = dsub high 64 bits
  • qh = qsub high 128+ bits

Such that:

  • 16-bit subregister 'hsub' <=> bl | bh
  • 32-bit subregister 'ssub' <=> bl | bh | hh
  • ..
  • 128-bit subregister 'qsub' <=> bl | bh | hh | sh | dh
  • 128+ bit subregister 'zsub' <=> bl | bh | hh | sh | dh | qh (z registers are scalable vector registers of 128-bits or more)

I would expect to have at least 6 regunits for: qh,dh,sh,hh,bh,bl to represent all addressable sub-registers in a single 128+ bit reg. At the moment it would add new regunits for tuples such that for 64-bit D register tuples (DD, DDD, DDDD) TableGen allocates the following regunits:

             sh,hh,bh,bl           sh,hh,bh,bl
  sh,hh,bh,bl           sh,hh,bh,bl
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  16                                         0

For 128-bit Q register tuples (QQ, QQQ, QQQQ) it would then create the following (additional) regunits:

                dh,sh,hh,bh,bl              dh,sh,hh,bh,bl
  dh,sh,hh,bh,bl              dh,sh,hh,bh,bl
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  20                                                     0

And then it would do a similar thing for Z register tuples (ZZ, ZZZ, ZZZZ). In total, this takes up (4 * 4) + (4 * 5) + (4 * 6) = 60 bits, which I agree is unnecessary.

The way I think we want to represent this, is as follows:

                          z2,q2,d2,qh,dh,sh,hh,bh,bl                          z0,q0,d0,qh,dh,sh,hh,bh,bl
z3,q3,d3,qh,dh,sh,hh,bh,bl                          z1,q1,d1,qh,dh,sh,hh,bh,bl

Which would only require 4 x 9 = 36 bits in total and would allow representing all tuples.

That said, I'm at wits end of how to represent this in TableGen, because with every try and turn I run into some TableGen assertion failure. I've tried using ComposedSubRegIndex to no avail. Not sure if there's some TableGen bugs I'm running into, or whether I'm just not describing this the right way using the existing constructs.

Could you give me some suggestions on what the right way is to represent it?

I do wonder if extending the number of bits for LaneBitmask is such a big problem in practice. I suspect at some point we'll need to have a wider bitmask anyway, given that all registers for a target use the same encoding space in lanebitmask. As was pointed out for AMDGPU it's already at the limit.

@arsenm
Copy link
Contributor

arsenm commented Oct 7, 2024

That said, I'm at wits end of how to represent this in TableGen, because with every try and turn I run into some TableGen assertion failure. I've tried using ComposedSubRegIndex to no avail. Not sure if there's some TableGen bugs I'm running into, or whether I'm just not describing this the right way using the existing constructs.

Yes, tablegen isn't the best software

Could you give me some suggestions on what the right way is to represent it?

I think above you described it correctly, it's a matter of just making it work

I do wonder if extending the number of bits for LaneBitmask is such a big problem in practice. I suspect at some point we'll need to have a wider bitmask anyway, given that all registers for a target use the same encoding space in lanebitmask. As was pointed out for AMDGPU it's already at the limit.

This is used in the inner loops of everything involving register liveness, which is already the slowest part in the slowest compiles. I think increasing this should only be done as a last resort.

@sdesmalen-arm
Copy link
Collaborator Author

Could you give me some suggestions on what the right way is to represent it?

I think above you described it correctly, it's a matter of just making it work

Diving deeper into TableGen it is then!

@sdesmalen-arm
Copy link
Collaborator Author

I've found the right way to model this now, avoiding the need for >64 lanes, so will close this PR. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants