You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm facing a runtime issue when deploying the meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 model on tgi-server:3.2.3 using 8 shards and FP8 configuration. Even with minimal token settings, the model fails during warmup with the following error:
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
...
captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch.
Server error: Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED
Error: Backend(Warmup(Generation("Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch. ")))
...
Error: WebserverFailed
The only workaround that gets the model running is by disabling CUDA graphs (i.e., --disable-cuda-graphs), but this results in extremely slow inference performance — practically unusable in production.
It seems the issue might be related to FP8 or PyTorch internals with CUDA graphs. Any guidance or fix on this would be appreciated.
Let me know if any additional info (env, GPU setup, torch version, etc.) is needed.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi team,
I'm facing a runtime issue when deploying the meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 model on tgi-server:3.2.3 using 8 shards and FP8 configuration. Even with minimal token settings, the model fails during warmup with the following error:
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions....
captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch.
Server error: Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED
Error: Backend(Warmup(Generation("Unexpected <class 'RuntimeError'>: captures_underway.empty() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":3085, please report a bug to PyTorch. ")))
...
Error: WebserverFailed
TGI Server Args:
The only workaround that gets the model running is by disabling CUDA graphs (i.e., --disable-cuda-graphs), but this results in extremely slow inference performance — practically unusable in production.
It seems the issue might be related to FP8 or PyTorch internals with CUDA graphs. Any guidance or fix on this would be appreciated.
Let me know if any additional info (env, GPU setup, torch version, etc.) is needed.
Thanks in advance!
The text was updated successfully, but these errors were encountered: