Weight gradient kernels for dense and MoE models #95

zheanxu · 2025-05-06T09:19:15Z

This Pull Request introduces deepgemm.wgrad_gemm_fp8_fp8_fp32_nt and k_grouped_wgrad_gemm_fp8_fp8_fp32_nt, optimized weight gradient kernels for dense and MoE models. These kernels achieve a ~20% speedup compared to the internal CUTLASS implementation.

For detailed usage, refer to the function documentation.

Weight gradient GEMMs for dense models

M	N	K	Opti BMxBN	Computation (TFLOPS)	Memory Bandwidth (GB/s)
7168	2112	4096	128x152	920	507
1536	24576	4096	128x152	986	582
512	32768	4096	128x152	878	1086
16384	7168	4096	128x152	994	342
7168	4096	4096	128x152	942	411
2048	7168	4096	128x152	920	513
7168	2112	8192	128x152	1052	451
1536	24576	8192	128x152	1092	511
512	32768	8192	128x152	1014	1129
16384	7168	8192	128x152	1079	240
7168	4096	8192	128x152	1061	333
2048	7168	8192	128x152	1037	452

Grouped weight gradient GEMMs for MoE models

Groups	M	N	K	Opti BMxBN	Computation (TFLOPS)	Memory Bandwidth (GB/s)
4	7168	4096	4096	128x152	939	409
4	2048	7168	4096	128x152	900	502
4	7168	4096	8192	128x152	1044	328
4	2048	7168	8192	128x152	1033	450
8	7168	4096	4096	128x152	942	411
8	2048	7168	4096	128x152	902	503

LyricZhao · 2025-05-06T09:27:28Z

I plan to merge it after #94, thanks!

LyricZhao · 2025-05-09T05:25:56Z

These kernels achieve a ~20% speedup compared to the internal CUTLASS implementation.

To clarify, you can refer to the profile-data repo for internal CUTLASS impl performance comparison.

hxdtest · 2025-05-13T06:33:51Z

m = 4096 
n = 1024
k = 4096


x = torch.ones((m, k), device='cuda', dtype=torch.bfloat16)
y = torch.rand((n, k), device='cuda', dtype=torch.bfloat16)
 
ref_out =  (x.float() @ y.float().t())
out1 = ref_out.clone() 
x_fp8 = per_token_cast_to_fp8(x)
y_fp8 = per_token_cast_to_fp8(y)
 
 
deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt(x_fp8, y_fp8, out1)
print(out1/ref_out)

why the difference is nearly twice ?

out1/ref_out
tensor([[1.9995, 2.0002, 2.0004,  ..., 2.0006, 1.9999, 2.0009]

--update--
it should be out1 = torch.zeros_like(ref_out)

zheanxu · 2025-05-13T07:01:15Z

@hxdtest Thank you very much for your feedback.
During backpropagation, W needs to accumulate W_grad, so deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt(x, y, out) was designed to perform out += [email protected]() instead of out = [email protected](). This detail was omitted in the documentation.

hxdtest · 2025-05-14T03:52:05Z

@hxdtest Thank you very much for your feedback. During backpropagation, W needs to accumulate W_grad, so deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt(x, y, out) was designed to perform out += [email protected]() instead of out = [email protected](). This detail was omitted in the documentation.

Thank you for your reply. After fix the test code, the results are close.

hxdtest · 2025-05-14T09:16:49Z

@hxdtest Thank you very much for your feedback. During backpropagation, W needs to accumulate W_grad, so deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt(x, y, out) was designed to perform out += [email protected]() instead of out = [email protected](). This detail was omitted in the documentation.

Fantastic Work！I used DeepGemm and built a fp8 Linear layer to replace torch.nn.Linear and run a rl job. It seems evaluation scores with mixed fp8 precision are close to scores with mixed bf16 experiment.

ajWithNucleus · 2025-05-17T03:07:58Z

@hxdtest can you please share your Linear layer wrapper as a quick start util. It will be helpful.

hxdtest · 2025-05-29T04:11:53Z

@hxdtest can you please share your Linear layer wrapper as a quick start util. It will be helpful.

https://github.com/hxdtest/fp8_verl/blob/add_fp8/verl/third_party/deep_gemm/fp8_linear.py

LyricZhao · 2025-05-29T06:01:19Z

Thanks for your work on the FP8 linear module. But the implements have lots of unfused kernels, e.g. per_token_cast_to_fp8 and kernels inside the call of the DeepGEMM function, which may lead to an overall lower performance.

Just a reminder if you care about the end-to-end performance :)

Init weight gradient kernels.

d5470d3

zheanxu requested a review from LyricZhao May 6, 2025 09:19

zheanxu self-assigned this May 6, 2025

zheanxu added 2 commits May 9, 2025 12:30

Merge branch 'main' into wgrad-gemm

adf5de0

Support unaligned n,k and gmem stride

919f55b

Update docs

6233709

LyricZhao added 2 commits May 14, 2025 14:18

Several cleanups

c4a7116

Remove restrictions on N

279eb03

This was referenced May 14, 2025

Handling k-Dimension Divisibility in Backward Matrix Multiplication for kv_a_proj_with_mqa (din=7168, dout=576) #93

Closed

Is DeepGEMM directly applicable to backward in training? #10

Open

Add stride(0) assertions

a6ced6f

LyricZhao approved these changes May 14, 2025

View reviewed changes

LyricZhao merged commit 04278f6 into main May 14, 2025

LyricZhao deleted the wgrad-gemm branch May 14, 2025 07:55

yhyang201 mentioned this pull request May 23, 2025

Bug: AssertionError in gemm_fp8_fp8_bf16_nt due to new test case in test_core.py #106

Open

kaimo455 mentioned this pull request May 28, 2025

[Perf] Grouped W-Grad GEMM is 10–18 % slower than the released benchmark on H200 (#95) #107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weight gradient kernels for dense and MoE models #95

Weight gradient kernels for dense and MoE models #95

Uh oh!

zheanxu commented May 6, 2025 •

edited

Loading

Uh oh!

LyricZhao commented May 6, 2025 •

edited

Loading

Uh oh!

LyricZhao commented May 9, 2025

Uh oh!

hxdtest commented May 13, 2025 •

edited

Loading

Uh oh!

zheanxu commented May 13, 2025

Uh oh!

hxdtest commented May 14, 2025

Uh oh!

hxdtest commented May 14, 2025 •

edited

Loading

Uh oh!

ajWithNucleus commented May 17, 2025 •

edited

Loading

Uh oh!

hxdtest commented May 29, 2025

Uh oh!

LyricZhao commented May 29, 2025

Uh oh!

Uh oh!

Weight gradient kernels for dense and MoE models #95

Weight gradient kernels for dense and MoE models #95

Uh oh!

Conversation

zheanxu commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Weight gradient GEMMs for dense models

Grouped weight gradient GEMMs for MoE models

Uh oh!

LyricZhao commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LyricZhao commented May 9, 2025

Uh oh!

hxdtest commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zheanxu commented May 13, 2025

Uh oh!

hxdtest commented May 14, 2025

Uh oh!

hxdtest commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajWithNucleus commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hxdtest commented May 29, 2025

Uh oh!

LyricZhao commented May 29, 2025

Uh oh!

Uh oh!

zheanxu commented May 6, 2025 •

edited

Loading

LyricZhao commented May 6, 2025 •

edited

Loading

hxdtest commented May 13, 2025 •

edited

Loading

hxdtest commented May 14, 2025 •

edited

Loading

ajWithNucleus commented May 17, 2025 •

edited

Loading