@@ -733,6 +733,96 @@ these intrinsics is a boolean flag, with the same functionality as described in
733
733
For more information, refer PTX ISA
734
734
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor >`_.
735
735
736
+ Warp Group Intrinsics
737
+ ---------------------
738
+
739
+ '``llvm.nvvm.wgmma.fence.sync.aligned ``'
740
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
741
+
742
+ Syntax:
743
+ """""""
744
+
745
+ .. code-block :: llvm
746
+
747
+ declare void @llvm.nvvm.wgmma.fence.sync.aligned()
748
+
749
+ Overview:
750
+ """""""""
751
+
752
+ The '``@llvm.nvvm.wgmma.fence.sync.aligned ``' intrinsic generates the
753
+ ``wgmma.fence.sync.aligned `` PTX instruction, which establishes an ordering
754
+ between prior accesses to any warpgroup registers and subsequent accesses to
755
+ the same registers by a ``wgmma.mma_async `` instruction.
756
+
757
+ The ``wgmma.fence `` instruction must be issued by all warps of the warpgroup in
758
+ the following locations:
759
+
760
+ * Before the first ``wgmma.mma_async `` operation in a warpgroup.
761
+ * Between a register access by a thread in the warpgroup and any
762
+ ``wgmma.mma_async `` instruction that accesses the same registers, except when
763
+ these are accumulator register accesses across multiple ``wgmma.mma_async ``
764
+ instructions of the same shape in which case an ordering guarantee is
765
+ provided by default.
766
+
767
+ For more information, refer PTX ISA
768
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence >`_.
769
+
770
+ '``llvm.nvvm.wgmma.commit_group.sync.aligned ``'
771
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
772
+
773
+ Syntax:
774
+ """""""
775
+
776
+ .. code-block :: llvm
777
+
778
+ declare void @llvm.nvvm.wgmma.commit_group.sync.aligned()
779
+
780
+ Overview:
781
+ """""""""
782
+
783
+ The '``@llvm.nvvm.wgmma.commit_group.sync.aligned ``' intrinsic generates the
784
+ ``wgmma.commit_group.sync.aligned `` PTX instruction, which creates a new
785
+ wgmma-group per warpgroup and batches all prior ``wgmma.mma_async ``
786
+ instructions initiated by the executing warp but not committed to any
787
+ wgmma-group into the new wgmma-group. If there are no uncommitted ``wgmma
788
+ mma_async `` instructions then, ``wgmma.commit_group `` results in an empty
789
+ wgmma-group.
790
+
791
+ An executing thread can wait for the completion of all ``wgmma.mma_async ``
792
+ operations in a wgmma-group by using ``wgmma.wait_group ``.
793
+
794
+ For more information, refer PTX ISA
795
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group >`_.
796
+
797
+ '``llvm.nvvm.wgmma.wait_group.sync.aligned ``'
798
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
799
+
800
+ Syntax:
801
+ """""""
802
+
803
+ .. code-block :: llvm
804
+
805
+ declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i32 immarg N)
806
+
807
+ Overview:
808
+ """""""""
809
+
810
+ The '``@llvm.nvvm.wgmma.wait_group.sync.aligned ``' intrinsic generates the
811
+ ``wgmma.commit_group.sync.aligned N `` PTX instruction, which will cause the
812
+ executing thread to wait until only ``N `` or fewer of the most recent
813
+ wgmma-groups are pending and all the prior wgmma-groups committed by the
814
+ executing threads are complete. For example, when ``N `` is 0, the executing
815
+ thread waits on all the prior wgmma-groups to complete. Operand ``N `` is an
816
+ integer constant.
817
+
818
+ Accessing the accumulator register or the input register containing the
819
+ fragments of matrix A of a ``wgmma.mma_async `` instruction without first
820
+ performing a ``wgmma.wait_group `` instruction that waits on a wgmma-group
821
+ including that ``wgmma.mma_async `` instruction is undefined behavior.
822
+
823
+ For more information, refer PTX ISA
824
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group >`_.
825
+
736
826
Other Intrinsics
737
827
----------------
738
828
0 commit comments