Skip to content

[NVPTX] Add intrinsics for wgmma.fence PTX instructions #120523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -733,6 +733,96 @@ these intrinsics is a boolean flag, with the same functionality as described in
For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.

Warp Group Intrinsics
---------------------

'``llvm.nvvm.wgmma.fence.sync.aligned``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.wgmma.fence.sync.aligned()

Overview:
"""""""""

The '``@llvm.nvvm.wgmma.fence.sync.aligned``' intrinsic generates the
``wgmma.fence.sync.aligned`` PTX instruction, which establishes an ordering
between prior accesses to any warpgroup registers and subsequent accesses to
the same registers by a ``wgmma.mma_async`` instruction.

The ``wgmma.fence`` instruction must be issued by all warps of the warpgroup in
the following locations:

* Before the first ``wgmma.mma_async`` operation in a warpgroup.
* Between a register access by a thread in the warpgroup and any
``wgmma.mma_async`` instruction that accesses the same registers, except when
these are accumulator register accesses across multiple ``wgmma.mma_async``
instructions of the same shape in which case an ordering guarantee is
provided by default.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence>`_.

'``llvm.nvvm.wgmma.commit_group.sync.aligned``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.wgmma.commit_group.sync.aligned()

Overview:
"""""""""

The '``@llvm.nvvm.wgmma.commit_group.sync.aligned``' intrinsic generates the
``wgmma.commit_group.sync.aligned`` PTX instruction, which creates a new
wgmma-group per warpgroup and batches all prior ``wgmma.mma_async``
instructions initiated by the executing warp but not committed to any
wgmma-group into the new wgmma-group. If there are no uncommitted ``wgmma
mma_async`` instructions then, ``wgmma.commit_group`` results in an empty
wgmma-group.

An executing thread can wait for the completion of all ``wgmma.mma_async``
operations in a wgmma-group by using ``wgmma.wait_group``.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group>`_.

'``llvm.nvvm.wgmma.wait_group.sync.aligned``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 immarg N)

Overview:
"""""""""

The '``@llvm.nvvm.wgmma.wait_group.sync.aligned``' intrinsic generates the
``wgmma.commit_group.sync.aligned N`` PTX instruction, which will cause the
executing thread to wait until only ``N`` or fewer of the most recent
wgmma-groups are pending and all the prior wgmma-groups committed by the
executing threads are complete. For example, when ``N`` is 0, the executing
thread waits on all the prior wgmma-groups to complete. Operand ``N`` is an
integer constant.

Accessing the accumulator register or the input register containing the
fragments of matrix A of a ``wgmma.mma_async`` instruction without first
performing a ``wgmma.wait_group`` instruction that waits on a wgmma-group
including that ``wgmma.mma_async`` instruction is undefined behavior.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group>`_.

Other Intrinsics
----------------

Expand Down
15 changes: 15 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -4805,6 +4805,21 @@ def int_nvvm_redux_sync_or : ClangBuiltin<"__nvvm_redux_sync_or">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
[IntrConvergent, IntrInaccessibleMemOnly, IntrNoCallback]>;

//
// WGMMA fence instructions
//
// wgmma.fence.sync.aligned;
def int_nvvm_wgmma_fence_sync_aligned
: Intrinsic<[], [], [IntrConvergent]>;

// wgmma.commit_group.sync.aligned;
def int_nvvm_wgmma_commit_group_sync_aligned
: Intrinsic<[], [], [IntrConvergent], "llvm.nvvm.wgmma.commit_group.sync.aligned">;

// wgmma.wait_group.sync.aligned N;
def int_nvvm_wgmma_wait_group_sync_aligned
: Intrinsic<[], [llvm_i64_ty], [IntrConvergent, ImmArg<ArgIndex<0>>], "llvm.nvvm.wgmma.wait_group.sync.aligned">;

//
// WMMA instructions
//
Expand Down
14 changes: 14 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -7484,4 +7484,18 @@ defm INT_SET_MAXNREG_DEC : SET_MAXNREG<"dec", int_nvvm_setmaxnreg_dec_sync_align

} // isConvergent

//
// WGMMA fence instructions
//
let isConvergent = true in {
def INT_NVVM_WGMMA_FENCE_SYNC_ALIGNED : NVPTXInst<(outs), (ins), "wgmma.fence.sync.aligned;",
[(int_nvvm_wgmma_fence_sync_aligned)]>, Requires<[hasSM90a, hasPTX<80>]>;

def INT_NVVM_WGMMA_COMMIT_GROUP_SYNC_ALIGNED : NVPTXInst<(outs), (ins), "wgmma.commit_group.sync.aligned;",
[(int_nvvm_wgmma_commit_group_sync_aligned)]>, Requires<[hasSM90a, hasPTX<80>]>;

def INT_NVVM_WGMMA_WAIT_GROUP_SYNC_ALIGNED : NVPTXInst<(outs), (ins i64imm:$n), "wgmma.wait_group.sync.aligned \t$n;",
[(int_nvvm_wgmma_wait_group_sync_aligned timm:$n)]>, Requires<[hasSM90a, hasPTX<80>]>;
} // isConvergent = true

def INT_EXIT : NVPTXInst<(outs), (ins), "exit;", [(int_nvvm_exit)]>;
47 changes: 47 additions & 0 deletions llvm/test/CodeGen/NVPTX/wgmma-sm90a-fence.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc < %s -march=nvptx64 -mcpu=sm_90a -mattr=+ptx80 | FileCheck %s
; RUN: %if ptxas-12.0 %{ llc < %s -march=nvptx64 -mcpu=sm_90a -mattr=+ptx80 | %ptxas-verify -arch=sm_90a %}

target triple = "nvptx64-nvidia-cuda"

declare void @llvm.nvvm.wgmma.fence.sync.aligned()

define void @test_wgmma_fence_sync_aligned() {
; CHECK-LABEL: test_wgmma_fence_sync_aligned(
; CHECK: {
; CHECK-EMPTY:
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: wgmma.fence.sync.aligned;
; CHECK-NEXT: ret;
call void @llvm.nvvm.wgmma.fence.sync.aligned()
ret void
}

declare void @llvm.nvvm.wgmma.commit_group.sync.aligned()

define void @test_wgmma_commit_group_sync_aligned() {
; CHECK-LABEL: test_wgmma_commit_group_sync_aligned(
; CHECK: {
; CHECK-EMPTY:
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: wgmma.commit_group.sync.aligned;
; CHECK-NEXT: ret;
call void @llvm.nvvm.wgmma.commit_group.sync.aligned()
ret void
}

declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64)

define void @test_wgmma_wait_group_sync_aligned() {
; CHECK-LABEL: test_wgmma_wait_group_sync_aligned(
; CHECK: {
; CHECK-EMPTY:
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: wgmma.wait_group.sync.aligned 10;
; CHECK-NEXT: ret;
call void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 10)
ret void
}
Loading