RuntimeError on CUDA capture with FP8 when deploying Llama-4-Maverick on TGI 3.2.3 with using H100 GPUS #3175

nskpro-cmd · 2025-04-15T09:55:27Z

Hi team,

I'm facing a runtime issue when deploying the meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 model on tgi-server:3.2.3 using 8 shards and FP8 configuration. Even with minimal token settings, the model fails during warmup with the following error:

RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
...
captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch.

Server error: Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED
Error: Backend(Warmup(Generation("Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch. ")))
...
Error: WebserverFailed

TGI Server Args:

--model-id meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
--sharded true
--num-shard 8
--max-input-tokens 1024
--max-batch-prefill-tokens 1024
--max-total-tokens 2048
--port 8000
--usage-stats off
--env
--kv-cache-dtype fp8_e5m2
--disable-custom-kernels

The only workaround that gets the model running is by disabling CUDA graphs (i.e., --disable-cuda-graphs), but this results in extremely slow inference performance — practically unusable in production.

It seems the issue might be related to FP8 or PyTorch internals with CUDA graphs. Any guidance or fix on this would be appreciated.

Let me know if any additional info (env, GPU setup, torch version, etc.) is needed.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

nskpro-cmd · 2025-04-15T09:57:40Z

@danieldk @Narsil could you please look into this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError on CUDA capture with FP8 when deploying Llama-4-Maverick on TGI 3.2.3 with using H100 GPUS #3175

RuntimeError on CUDA capture with FP8 when deploying Llama-4-Maverick on TGI 3.2.3 with using H100 GPUS #3175

nskpro-cmd commented Apr 15, 2025

nskpro-cmd commented Apr 15, 2025 •

edited

Loading

RuntimeError on CUDA capture with FP8 when deploying Llama-4-Maverick on TGI 3.2.3 with using H100 GPUS #3175

RuntimeError on CUDA capture with FP8 when deploying Llama-4-Maverick on TGI 3.2.3 with using H100 GPUS #3175

Comments

nskpro-cmd commented Apr 15, 2025

nskpro-cmd commented Apr 15, 2025 • edited Loading

nskpro-cmd commented Apr 15, 2025 •

edited

Loading