You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ nvidia-smi
Sun Apr 27 01:29:31 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM3-32GB On | 00000000:BE:00.0 Off | 0 |
| N/A 34C P0 51W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
$
While trying to load the model I come across this error
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-04-27T01:39:48.199360Z ERROR warmup{max_input_length=Some(4095) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Steps to reproduce the behaviours.
Red Hat openshift has all the operators required to deploy models and already has some TGI model running on it.
Now when I check the pod logs where model is supposed to be up and running, below is the full log i see with the error
vchalla@vchalla-thinkpadp1gen2:~$ oc logs pod/llama-32-11b-vision-instruct-predictor-00001-deployment-7dt6c9n -f
2025-04-27T01:45:08.141965Z INFO text_generation_launcher: Args {
model_id: "/mnt/models/",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
4095,
),
max_total_tokens: Some(
4096,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
4096,
),
max_batch_total_tokens: Some(
8192,
),
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "llama-32-11b-vision-instruct-predictor-00001-deployment-7dt6c9n",
port: 3000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/tmp/hf_hub_cache",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-04-27T01:45:09.963558Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-04-27T01:45:09.963574Z INFO text_generation_launcher: Forcing attention to 'paged' because head dim is not supported by flashinfer, also disabling prefix caching
2025-04-27T01:45:09.963578Z INFO text_generation_launcher: Using attention paged - Prefix caching 0
2025-04-27T01:45:09.963597Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-04-27T01:45:09.964112Z INFO download: text_generation_launcher: Starting check and download process for /mnt/models/
2025-04-27T01:45:17.403007Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-04-27T01:45:18.285805Z INFO download: text_generation_launcher: Successfully downloaded weights for /mnt/models/
2025-04-27T01:45:18.286146Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-04-27T01:45:25.790482Z INFO text_generation_launcher: Using prefix caching = False
2025-04-27T01:45:25.790529Z INFO text_generation_launcher: Using Attention = paged
2025-04-27T01:45:28.310241Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-04-27T01:45:35.606682Z INFO text_generation_launcher: Using prefill chunking = False
2025-04-27T01:45:36.532700Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-04-27T01:45:36.620877Z INFO shard-manager: text_generation_launcher: Shard ready in 18.323632258s rank=0
2025-04-27T01:45:36.705941Z INFO text_generation_launcher: Starting Webserver
2025-04-27T01:45:36.774287Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-04-27T01:45:36.855219Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-04-27T01:45:45.542531Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 10, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1577, in warmup
_, _batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1963, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/mllama_causal_lm.py", line 312, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/mllama.py", line 1021, in forward
outputs = self.text_model(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 679, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 597, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 477, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 242, in forward
attn_output = attention(
File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 300, in attention
.reshape(original_shape[0], -1, original_shape[2])
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-04-27T01:45:45.662979Z ERROR warmup{max_input_length=Some(4095) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Backend(Warmup(Generation("CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n")))
2025-04-27T01:45:45.668918Z ERROR text_generation_launcher: Webserver Crashed
2025-04-27T01:45:45.668935Z INFO text_generation_launcher: Shutting down shards
2025-04-27T01:45:45.734131Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2025-04-27T01:45:45.736331Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2025-04-27T01:45:46.137428Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
Expected behavior
Expecting the model to come up and be ready to serve requests.
The text was updated successfully, but these errors were encountered:
System Info
When loading model: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct through openshift AI, text-generation-interface isn't able to detect that my cluster has GPU. Attaching GPU information below
While trying to load the model I come across this error
Information
Tasks
Reproduction
Steps to reproduce the behaviours.
Expected behavior
Expecting the model to come up and be ready to serve requests.
The text was updated successfully, but these errors were encountered: