Skip to content

[AutoBump] Merge with b4e81fd1 (Jan 24) (20) #558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 196 commits into
base: bump_to_8388040f
Choose a base branch
from

Conversation

jorickert
Copy link

No description provided.

asb and others added 30 commits January 23, 2025 10:22
…RISC-V failure (llvm#123959)

I'm not sure why the test is larger for RISC-V than other targets, but
we saw this before with llvm#111360.

The file is just over the current 60KB limit:

```
62772 /home/asb/llvm-project/build/stage2/tools/clang/test/Modules/Output/empty.modulemap.tmp/base.pcm
```
…123648)

Summary:
This PR fixes bugreport
llvm#122493 The root problem is
the same as before lambda function and DeclRefExpr references a variable
that does not belong to the same module as the enclosing function body.
Therefore iteration over the function body doesn’t visit the VarDecl.
Before this change RelatedDeclsMap was created only for canonical decl
but in reality it has to be done for the definition of the function that
does not always match the canonical decl.

Test Plan: check-clang
Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.

* Update the existing tests to check the lowering to intrinsics.
* Add newer cp_async_zfill tests to verify the lowering for the 'cpSize'
   variants.
* Tidy-up CHECK lines in cp_async() function in nvvmir.mlir (NFC)

PTX spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async

Signed-off-by: Durgadoss R <[email protected]>
…123967)

We use variable locations such as DBG_VALUE $xmm0 as shorthand to refer
to "the low lane of $xmm0", and this is reflected in how DWARF is
interpreted too. However InstrRefBasedLDV tries to be smart and
interprets such a DBG_VALUE as a 128-bit reference. We then issue a
DW_OP_deref_size of 128 bits to the stack, which isn't permitted by
DWARF (it's larger than a pointer).

Solve this for now by not using DW_OP_deref_size if it would be illegal.
Instead we'll use DW_OP_deref, and the consumer will load the variable
type from the stack, which should be correct.

There's still a risk of imprecision when LLVM decides to use smaller or
larger value types than the source-variable type, which manifests as
too-little or too-much memory being read from the stack. However we
can't solve that without putting more type information in debug-info.

fixes llvm#64093
… object parameters (llvm#124096)

LLDB deduces the CV-qualifiers and storage class of a C++ method from
the object parameter. Currently it assumes that parameter is implicit
(and is a pointer type with the name "this"). This isn't true anymore in
C++23 with explicit object parameters. To support those we can simply
check the `DW_AT_object_pointer` of the subprogram DIE (works for both
declarations and definitions) when searching for the object parameter.

We can also remove the check for `eEncodingIsPointerUID`, because in C++
an artificial parameter called `this` is only ever the implicit object
parameter (at least for all the major compilers).
This patch adds NVVM intrinsics and NVPTX codegen for:

- cp.async.bulk.prefetch.L2.* variants 
- These intrinsics optionally support cache_hints as indicated by the
   boolean flag argument.
- Lit tests are added for all combinations of these intrinsics in
   cp-async-bulk.ll.
- The generated PTX is verified with a 12.3 ptxas executable.
- Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch


Co-authored-by: abmajumder <[email protected]>
…explicit object parameters" (llvm#124100)

Reverts llvm#124096

Broke linux CI:
```
Note: This is test shard 7 of 42.
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DWARFASTParserClangTests
[ RUN      ] DWARFASTParserClangTests.TestParseSubroutine_ExplicitObjectParameter
Expected<T> must be checked before access or destruction.
Expected<T> value was in success state. (Note: Expected<T> values in success mode must still be checked prior to being destroyed).
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  SymbolFileDWARFTests 0x0000560271ee5ba7
1  SymbolFileDWARFTests 0x0000560271ee3a2c
2  SymbolFileDWARFTests 0x0000560271ee63ea
3  libc.so.6            0x00007f3e54e5b050
4  libc.so.6            0x00007f3e54ea9e2c
5  libc.so.6            0x00007f3e54e5afb2 gsignal + 18
6  libc.so.6            0x00007f3e54e45472 abort + 211
7  SymbolFileDWARFTests 0x0000560271e79d51
8  SymbolFileDWARFTests 0x0000560271e724f7
9  SymbolFileDWARFTests 0x0000560271f39e2c
10 SymbolFileDWARFTests 0x0000560271f3b368
11 SymbolFileDWARFTests 0x0000560271f3c053
12 SymbolFileDWARFTests 0x0000560271f4cf67
13 SymbolFileDWARFTests 0x0000560271f4c18a
14 SymbolFileDWARFTests 0x0000560271f2561c
15 libc.so.6            0x00007f3e54e4624a
16 libc.so.6            0x00007f3e54e46305 __libc_start_main + 133
17 SymbolFileDWARFTests 0x0000560271e65161
```
Nothing in VPlan.h directly uses VPBlockUtils.h. Move it out to the more
appropriate VPlanUtils.h to reduce the size of the widely included VPlan.h.
…vm#124101)

Reverts llvm#123393

This is causing `TestVectorOfVectorsFromStdModule.py` to fail on the the
macOS clang-15 matrix bot.
Prevents avoidable memory leaks.

Looks like exchange added in aa1333a
didn't take "continue" into account.

```
==llc==2150782==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 10 byte(s) in 1 object(s) allocated from:
    #0 0x5f1b0f9ac14a in strdup llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:593:3
    #1 0x5f1b1768428d in FileToRemoveList llvm-project/llvm/lib/Support/Unix/Signals.inc:105:55
```
…explicit object parameters" (llvm#124100)"

This reverts commit a802093.

Relands original commit but fixing the unit-test to consume the
`llvm::Expected` error object.
Using a "random" name for an "anonymous" pipe seems to be the state of
the art on windows (according to stack overflow, new windows versions
may have something better, but it involves calling kernel APIs directly
and generally a lot of dark magic).

The problem with the current method was that is does not produce unique
names if one has two copies of the pipe code in the same process, which
is what happened with llvm#120457 (because liblldb only exposes the public
api, and we've started using the pipe code in lldb-dap as well).

This patch works around the problem by adding the address of the counter
variable to the pipe name.

Replicating the multiple-copies setup in a test would be very difficult,
which is why I'm not adding a test for this scenario.
.. by changing the signal stop reason format 🤦

The reason this did not work is because the code in
`StopInfo::GetCrashingDereference` was looking for the string "address="
to extract the address of the crash. Macos stop reason strings have the
form
```
  EXC_BAD_ACCESS (code=1, address=0xdead)
```
while on linux they look like:
```
  signal SIGSEGV: address not mapped to object (fault address: 0xdead)
```

Extracting the address from a string sounds like a bad idea, but I
suppose there's some value in using a consistent format across
platforms, so this patch changes the signal format to use the equals
sign as well. All of the diagnose tests pass except one, which appears
to fail due to something similar llvm#115453 (disassembler reports
unrelocated call targets).

I've left the tests disabled on windows, as the stop reason reporting
code works very differently there, and I suspect it won't work out of
the box. If I'm wrong -- the XFAIL will let us know.
…pecializations (llvm#118167)

Some templates in the standard library are illegal to specialize for users
(even if the specialization contains user-defined types). The [[clang::no_specializations]]
attribute allows marking such base templates so that the compiler will
diagnose if users try adding a specialization.
There were two implementations of this - one that implemented nextafter
in software, and another that called a clang builtin. No in-tree targets
called the builtin, so all targets build the software version. The
builtin version has been removed, and the software version has been
renamed to be the "default".

This commit also optimizes nextafter, to avoid scalarization as much as
possible. Note however that the (CLC) relational builtins still
scalarize; those will be optimized in a separate commit.

Since nextafter is used by some convert_type builtins, the diff to IR
codegen is not limited to the builtin itself.
The code that checks a predicate against a swapped predicate in
isImpliedCondBalancedTypes is not covered by any existing test, within
any Analysis or Transform. Fix this by adding a test to SCEV.
…2782)

Support reductions in SCFToGPU: `scf.parallel` and `scf.reduce` op
combination is now converted to a `gpu.all_reduce` op.
Intel docs have been updated to be similar to AMD and now describe
BSF/BSR as not changing the destination register if the input value was
zero, which allows us to support CTTZ/CTLZ zero-input cases by setting
the destination to support a NumBits result (BSR is a bit messy as it
has to be XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also
been confirmed to match this behaviour.

This patch adjusts the X86ISD::BSF/BSR nodes to take a "pass through"
argument for zero-input cases, by default this is set to UNDEF to match
existing behaviour, but it can be set to a suitable value if supported.

There are still some limits to this - its only supported for x86_64
capable processors (and I've only enabled it for x86_64 codegen), and
Intel CPUs sometimes zero the upper 32-bits of a pass through register
when used for BSR32/BSF32 with a zero source value (i.e. the whole
64bits may not get passed through).

Fixes llvm#122004
Split target under LLVMIR/Transforms to avoid deps loop.
This code was using a pre-move-semantics trick of using std::swap to
avoid expensive vector copies.
…123942)

This is meant as a short-term workaround for an invalid conversion in
this pass that occurs because existing SDWA selections are not correctly
taken into account during the conversion.

See the draft PR llvm#123221 for an attempt to fix the actual issue.

---------

Co-authored-by: Frederik Harwath <[email protected]>
This header has been removed in C++20 and causes a large amount of
deprecation spam when building against libstdc++ 15 in C++17 mode.

As far as I understand, we just need to include *some* STL header to get
access to the version macros, and as this header also includes
<optional> nowadays we can just drop the <cstd646> include entirely.
…lvm#124089)

Fixes llvm#123800

Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.

For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

```

With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
functions (we only checked the direct calls).

Co-authored-by: Jon Chesterfield <[email protected]>
…23684)

For VALU shuffles, this saves an instruction in some case.
…vm#124116)

Summary:
We pass the `-nvptx-lower-global-ctor-dtor` option to support the `libc`
like use-case which needs global constructors sometimes. This only
affects the backend. If the NVPTX target is not enabled this option will
be unknown which prevents you from compiling generic IR for this.
RKSimon and others added 30 commits January 24, 2025 09:12
…etCastInstrCost without argument data

We don't use the IntrinsicCostAttributes arguments so, which allows us to use in type-only analysis in a future patch.
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:

  // FIXME: Replace the uses of is(), get() and dyn_cast() with
  //        isa<T>, cast<T> and the llvm::dyn_cast<T>

Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses dyn_cast
because we expect Source to be nonnull.
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:

  // FIXME: Replace the uses of is(), get() and dyn_cast() with
  //        isa<T>, cast<T> and the llvm::dyn_cast<T>

Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses dyn_cast
because we expect UPP.first to be nonnull.
)

TySan supports some preprocessor checks and ignorelists, but they are
currently untested. This PR adds some tests to make sure they all work.

@fhahn @AaronBallman, this is based off the discussion in the
documentation PR [llvm#123595]
llvm#124154)

This makes it more clear what you the author must do, and what reviewers
can expect you to do, before an approved PR can be merged. Spliting out
the email bit into a section also means we can link directly to it in
discussions.

This relies on one of those parties actually reading this, but I plan to
tackle the case where they don't with some new automation.
…os (llvm#123755)

The FORM_TRANSPOSED_REG_TUPLE pseudo nodes use either the ZPR2Mul2
or ZPR4Mul4 register classes for output. This patch changes the class
so that these can be extended to other multi-vector intrinsics which
instead create a ZPR2/ZPR4 register sequence.
VBMI2 CPUs actually have vector funnel shift instruction support
…#120338)

This patch align the debug location of the widen-phi to the debug
location of original phi.

Split from: llvm#120054
Most of this is mis-compiling with +fullfp16 and should be disabled for GISel.
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc  - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
- need to end up with sgpr register class after instruction selection
- In most cases Uniformity analysis declares them as uniform
  (forced by tablegen) resulting in sgpr S32/S64 reg bank
- When Uniformity analysis declares them as divergent (some phis),
  use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
- COPY to vgpr with implicit use of $exec inside of the cycle
- this copy is declared as uniform by uniformity analysis
- make sure that assigned bank is vgpr
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
- TODO: implicit use of $exec could be implemented as indicator
  that instruction is divergent
…m#123900)

... for the dynamic blocks created for operator new calls. This way we
get the type of memory allocated right. As a side-effect, the
diagnostics now point to the std::allocator calls, which is an
improvement.
)

As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.

This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.

We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
When generating `arm_neon.h`, NeonEmitter outputs code that
violates strict aliasing rules (C23 6.5 Expressions #7,
C++23 7.2.1 Value category [basic.lval] #11), for example:

    bfloat16_t __reint = __p0;
    uint32_t __reint1 = (uint32_t)(*(uint16_t *) &__reint) << 16;
    __ret = *(float32_t *) &__reint1;

This patch fixed the offending code by replacing it with
a call to `__builtin_bit_cast`.
This patch adds SM and PTX versions for SM
101, 120 and their arch-accelerated variants.

All these are supported in cuda-12.8.
sm120/120a requires ptx8.7 and the rest require ptx8.6.

Signed-off-by: Durgadoss R <[email protected]>
Lower G_ instructions that can't be inst-selected with register bank
assignment from AMDGPURegBankSelect based on uniformity analysis.
- Lower instruction to perform it on assigned register bank
- Put uniform value in vgpr because SALU instruction is not available
- Execute divergent instruction in SALU - "waterfall loop"

Given LLTs on all operands after legalizer, some register bank
assignments require lowering while other do not.
Note: cases where all register bank assignments would require lowering
are lowered in legalizer.

AMDGPURegBankLegalize goals:
- Define Rules: when and how to perform lowering
- Goal of defining Rules it to provide high level table-like brief
  overview of how to lower generic instructions based on available
  target features and uniformity info (uniform vs divergent).
- Fast search of Rules, depends on how complicated Rule.Predicate is
- For some opcodes there would be too many Rules that are essentially
  all the same just for different combinations of types and banks.
  Write custom function that handles all cases.
- Rules are made from enum IDs that correspond to each operand.
  Names of IDs are meant to give brief description what lowering does
  for each operand or the whole instruction.
- AMDGPURegBankLegalizeHelper implements lowering algorithms

Since this is the first patch that actually enables -new-reg-bank-select
here is the summary of regression tests that were added earlier:
- if instruction is uniform always select SALU instruction if available
- eliminate back to back vgpr to sgpr to vgpr copies of uniform values
- fast rules: small differences for standard and vector instruction
- enabling Rule based on target feature - salu_float
- how to specify lowering algorithm - vgpr S64 AND to S32
- on G_TRUNC in reg, it is up to user to deal with truncated bits
  G_TRUNC in reg is treated as no-op.
- dealing with truncated high bits - ABS S16 to S32
- sgpr S1 phi lowering
- new opcodes for vcc-to-scc and scc-to-vcc copies
- lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC)
- S1 zext and sext lowering to select
- uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into
  SALU instruction
- divergent phi with uniform inputs
- divergent instruction with temporal divergent use, source instruction
  is defined as uniform(AMDGPURegBankSelect) - missing temporal
  divergence lowering
- uniform phi, because of undef incoming, is assigned to vgpr. Will be
  fixed in AMDGPURegBankSelect via another fix in machine uniformity
  analysis.
…vm#117939)

Canonicalize gathers/scatters with contiguous (i.e. [0, 1, 2, ...])
offsets into vector masked load/store ops.
…#123958)

`TimerGroup` don't need to use as field of `ClangTidyProfiling`.
We can construct it local during destructing.
…23454)

skip header file before register AST Matchers
it can avoid to matcher lots of ast node when lint header file
Add IDs for bit width that cover multiple LLTs: B32 B64 etc.
"Predicate" wrapper class for bool predicate functions used to
write pretty rules. Predicates can be combined using &&, || and !.
Lowering for splitting and widening loads.
Write rules for loads to not change existing mir tests from old
regbankselect.
…l coroutine clones (llvm#118628)

Summary:
CoroCloner, by calling into CloneFunctionInto, does a lot of repeated
work priming DIFinder and building a list of common module-level debug
info metadata. For programs compiled with full debug info this can get
very expensive.

This diff builds the data once and shares it between all clones.

Anecdata for a sample cpp source file compiled with full debug info:

|                 | Baseline | IdentityMD set | Prebuilt CommonDI (cur.) |
|-----------------|----------|----------------|--------------------------|
| CoroSplitPass   | 306ms    | 221ms          | 68ms                     |
| CoroCloner      | 101ms    | 72ms           | 0.5ms                    |
| CollectCommonDI | -        | -              | 63ms                     |
| Speed up        | 1x       | 1.4x           | 4.5x                     |

Note that CollectCommonDebugInfo happens once *per coroutine* rather than per clone.

Test Plan:
ninja check-llvm-unit
ninja check-llvm

Compiled a sample internal source file, checked time trace output for scope timings.
…2866)

Change existing code for G_PHI to match what LLVM-IR version is doing
via PHINode::hasConstantOrUndefValue. This is not safe for regular PHI
since it may appear with an undef operand and getVRegDef can fail.
Most notably this improves number of values that can be allocated
to sgpr in AMDGPURegBankSelect.
Common case here are phis that appear in structurize-cfg lowering
for cycles with multiple exits:
Undef incoming value is coming from block that reached cycle exit
condition, if other incoming is uniform keep the phi uniform despite
the fact it is joining values from pair of blocks that are entered
via divergent condition branch.
This is the behavior expected by DWARF. It also requires some fixups to
algorithms which were storing the addresses of some objects (Blocks and
Variables) relative to the beginning of the function.

There are plenty of things that still don't work in this setups, but
this change is sufficient for the expression evaluator to correctly
recognize the entry point of a function in this case.
…llvm#123745)

Add the following workflows:

- `fullbuild` on aarch64 ubuntu
- `overlay` on windows 2025
- `overlay` on aarch64 ubuntu

`ccache` variant is used on `aarch64` due to
hendrikmuhs/ccache-action#279
…ot (llvm#121463)

In function handleMFLOSlot, we may get a variable LastInstInFunction
with a value of true from function getNextMachineInstr and IInSlot may
be null which would trigger an assert.
So we need to skip this case.

Fix llvm#118223.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.