Skip to content

[NVPTX] Add TMA bulk tensor copy intrinsics #96083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,143 @@ to left-shift the found bit into the most-significant bit position, otherwise
the result is the shift amount needed to right-shift the found bit into the
least-significant bit position. 0xffffffff is returned if no 1 bit is found.

TMA family of Intrinsics
------------------------

'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.1d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.2d(..., i32 %d0, i32 %d1, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)

Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``' intrinsics
correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
These instructions initiate an asynchronous copy of tensor data from
global memory to shared::cluster memory (indicated by the ``g2s`` prefix)
in ``tile`` mode. In tile mode, the multi-dimensional layout of the
source tensor is preserved at the destination. The dimension of the
tensor data ranges from 1d to 5d with the coordinates specified
by the ``i32 %d0 ... i32 %d4`` arguments.

* The last two arguments to these intrinsics are boolean flags
indicating support for cache_hint and/or multicast modifiers.
These flag arguments must be compile-time constants. The backend
looks through these flags and lowers the intrinsics appropriately.

* The Nth argument (denoted by ``i1 flag_ch``) when set, indicates
a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
variant of the PTX instruction.

* The [N-1]th argument (denoted by ``i1 flag_mc``) when set, indicates
the presence of a multicast mask (``i16 %mc``) and generates the PTX
instruction with the ``.multicast::cluster`` modifier.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.

'``llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)

Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d``' intrinsics
correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
These instructions initiate an asynchronous copy of tensor data from
global memory to shared::cluster memory (indicated by the ``g2s`` prefix)
in ``im2col`` mode. In im2col mode, some dimensions of the source tensor
are unrolled into a single dimensional column at the destination. In this
mode, the tensor has to be at least three-dimensional. Along with the tensor
coordinates, im2col offsets are also specified (denoted by
``i16 im2col0...i16 %im2col2``). The number of im2col offsets is two less
than the number of dimensions of the tensor operation. The last two arguments
to these intrinsics are boolean flags, with the same functionality as described
in the ``tile`` mode intrinsics above.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.

'``llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.2d(..., i32 %d0, i32 %d1, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)

Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d``' intrinsics
correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
These instructions initiate an asynchronous copy of tensor data from
shared::cta to global memory (indicated by the ``s2g`` prefix)
in ``tile`` mode. The dimension of the tensor data ranges from 1d to 5d
with the coordinates specified by the ``i32 %d0 ... i32 %d4`` arguments.

* The last argument to these intrinsics is a boolean flag
indicating support for cache_hint. This flag argument must
be a compile-time constant. When set, it indicates a valid
cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
variant of the PTX instruction.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.

'``llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[3-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)

Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[1-5]d``' intrinsics
correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
These instructions initiate an asynchronous copy of tensor data from
shared::cta to global memory (indicated by the ``s2g`` prefix)
in ``im2col`` mode. In this mode, the tensor has to be at least
three-dimensional. Unlike the ``g2s`` variants, there are no
im2col_offsets for these intrinsics. The last argument to these
intrinsics is a boolean flag, with the same functionality as
described in the ``s2g.tile`` mode intrinsics above.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.

Other Intrinsics
----------------

Expand Down
58 changes: 58 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -567,6 +567,52 @@ class SHFL_INFO<bit sync, string mode, string type, bit return_pred> {
[OpType, llvm_i32_ty, llvm_i32_ty]);
}

class CP_ASYNC_BULK_TENSOR_G2S_INTR<int dim, string mode> {
string Name = "int_nvvm_cp_async_bulk_tensor_g2s_" # mode # "_" # dim # "d";

bit IsIm2Col = !if(!eq(mode, "im2col"), 1, 0);
int NumIm2ColOffsets = !if(IsIm2Col, !add(dim, -2), 0);
list<LLVMType> Im2ColOffsetsTy = !listsplat(llvm_i16_ty, NumIm2ColOffsets);
list<LLVMType> TensorDimsTy = !listsplat(llvm_i32_ty, dim);
list<LLVMType> ArgsTy = !listconcat(
[llvm_shared_ptr_ty, // dst_smem_ptr
llvm_shared_ptr_ty, // mbarrier_smem_ptr
llvm_ptr_ty], // tensormap_ptr
TensorDimsTy, // actual tensor dims
Im2ColOffsetsTy, // im2col offsets
[llvm_i16_ty, // cta_mask
llvm_i64_ty, // cache_hint
llvm_i1_ty, // Flag for cta_mask
llvm_i1_ty] // Flag for cache_hint
);

int TempFlagsStartIdx = !add(dim, 5);
int FlagsStartIdx = !add(TempFlagsStartIdx, NumIm2ColOffsets);
list<IntrinsicProperty> IntrProp = [IntrConvergent,
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>,
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>, NoCapture<ArgIndex<2>>,
ImmArg<ArgIndex<FlagsStartIdx>>,
ImmArg<ArgIndex<!add(FlagsStartIdx, 1)>>];
}

class CP_ASYNC_BULK_TENSOR_S2G_INTR<int dim, string mode> {
string Name = "int_nvvm_cp_async_bulk_tensor_s2g_" # mode # "_" # dim # "d";

list<LLVMType> TensorDimsTy = !listsplat(llvm_i32_ty, dim);
list<LLVMType> ArgsTy = !listconcat(
[llvm_shared_ptr_ty, // src_smem_ptr
llvm_ptr_ty], // tensormap_ptr
TensorDimsTy, // actual tensor dims
[llvm_i64_ty, // cache_hint
llvm_i1_ty] // Flag for cache_hint
);
int FlagsStartIdx = !add(dim, 3);
list<IntrinsicProperty> IntrProp = [IntrConvergent,
ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
ImmArg<ArgIndex<FlagsStartIdx>>];
}

let TargetPrefix = "nvvm" in {
def int_nvvm_prmt : ClangBuiltin<"__nvvm_prmt">,
DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
Expand Down Expand Up @@ -4847,4 +4893,16 @@ def int_nvvm_setmaxnreg_dec_sync_aligned_u32
def int_nvvm_exit : ClangBuiltin<"__nvvm_exit">,
Intrinsic<[], [], [IntrConvergent, IntrInaccessibleMemOnly, IntrNoReturn]>;

// Intrinsics for Tensor Copy using TMA
// G2S -> From Global to Shared memory variants
// S2G -> From Shared to Global memory variants
foreach dim = [1, 2, 3, 4, 5] in {
foreach mode = !if(!ge(dim, 3), ["tile", "im2col"], ["tile"]) in {
foreach g2s = [CP_ASYNC_BULK_TENSOR_G2S_INTR<dim, mode>] in
def g2s.Name : DefaultAttrsIntrinsic<[], g2s.ArgsTy, g2s.IntrProp>;
foreach s2g = [CP_ASYNC_BULK_TENSOR_S2G_INTR<dim, mode>] in
def s2g.Name : DefaultAttrsIntrinsic<[], s2g.ArgsTy, s2g.IntrProp>;
}
}

} // let TargetPrefix = "nvvm"
Loading
Loading