Skip to content

RuntimeError on CUDA capture with FP8 when deploying Llama-4-Maverick on TGI 3.2.3 with using H100 GPUS #3175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nskpro-cmd opened this issue Apr 15, 2025 · 1 comment

Comments

@nskpro-cmd
Copy link

Hi team,

I'm facing a runtime issue when deploying the meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 model on tgi-server:3.2.3 using 8 shards and FP8 configuration. Even with minimal token settings, the model fails during warmup with the following error:

RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
...
captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch.

Server error: Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED
Error: Backend(Warmup(Generation("Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch. ")))
...
Error: WebserverFailed

TGI Server Args:

  • --model-id meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
  • --sharded true
  • --num-shard 8
  • --max-input-tokens 1024
  • --max-batch-prefill-tokens 1024
  • --max-total-tokens 2048
  • --port 8000
  • --usage-stats off
  • --env
  • --kv-cache-dtype fp8_e5m2
  • --disable-custom-kernels

The only workaround that gets the model running is by disabling CUDA graphs (i.e., --disable-cuda-graphs), but this results in extremely slow inference performance — practically unusable in production.

It seems the issue might be related to FP8 or PyTorch internals with CUDA graphs. Any guidance or fix on this would be appreciated.

Let me know if any additional info (env, GPU setup, torch version, etc.) is needed.

Thanks in advance!

@nskpro-cmd
Copy link
Author

nskpro-cmd commented Apr 15, 2025

@danieldk @Narsil could you please look into this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant