Skip to content

Failure to install https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct #3195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 4 tasks
vishnuchalla opened this issue Apr 27, 2025 · 0 comments
Open
3 of 4 tasks

Comments

@vishnuchalla
Copy link

System Info

When loading model: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct through openshift AI, text-generation-interface isn't able to detect that my cluster has GPU. Attaching GPU information below

$ nvidia-smi
Sun Apr 27 01:29:31 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM3-32GB           On  | 00000000:BE:00.0 Off |                    0 |
| N/A   34C    P0              51W / 350W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ 

While trying to load the model I come across this error

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-04-27T01:39:48.199360Z ERROR warmup{max_input_length=Some(4095) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to reproduce the behaviours.

  1. Red Hat openshift has all the operators required to deploy models and already has some TGI model running on it.
  2. Apply the below serving runtime.
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: hf-tgi-runtime-experimental
spec:
  containers:
  - args:
    - --model-id=/mnt/models/
    - --port=3000
    - --max-total-tokens=4096
    - --max-input-length=4095
    - --max-batch-total-tokens=8192
    - --max-batch-prefill-tokens=4096
    command:
    - text-generation-launcher
    env:
    - name: HF_HOME
      value: /tmp/hf_home
    - name: HUGGINGFACE_HUB_CACHE
      value: /tmp/hf_hub_cache
    - name: TRANSFORMER_CACHE
      value: /tmp/transformers_cache
    - name: NUMBA_CACHE_DIR
      value: /tmp/numba_cache
    - name: OUTLINES_CACHE_DIR
      value: /tmp/outlines_cache
    - name: TRITON_CACHE_DIR
      value: /tmp/triton_cache
    - name: PYTORCH_CUDA_ALLOC_CONF
      value: expandable_segments:True
    image: ghcr.io/huggingface/text-generation-inference:3.1.0
    livenessProbe:
      exec:
        command:
        - curl
        - localhost:3000/health
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 10
    name: kserve-container
    ports:
    - containerPort: 3000
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - curl
        - localhost:3000/health
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 10
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
  multiModel: true
  supportedModelFormats:
  - autoSelect: true
    name: pytorch
  1. And apply the below inference service
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.knative.openshift.io/enablePassthrough: "true"
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/rewriteAppHTTPProbers: "true"
  name: llama-32-11b-vision-instruct
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 1
    serviceAccountName: sa
    timeout: 240
    model:
      modelFormat:
        name: pytorch
      runtime: hf-tgi-runtime-experimental
      storageUri: pvc://hackathon-models/llama-3.2-11b-vision-instruct
  1. Now when I check the pod logs where model is supposed to be up and running, below is the full log i see with the error
vchalla@vchalla-thinkpadp1gen2:~$ oc logs pod/llama-32-11b-vision-instruct-predictor-00001-deployment-7dt6c9n -f
2025-04-27T01:45:08.141965Z  INFO text_generation_launcher: Args {
    model_id: "/mnt/models/",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        4095,
    ),
    max_total_tokens: Some(
        4096,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        4096,
    ),
    max_batch_total_tokens: Some(
        8192,
    ),
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "llama-32-11b-vision-instruct-predictor-00001-deployment-7dt6c9n",
    port: 3000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/tmp/hf_hub_cache",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
}
2025-04-27T01:45:09.963558Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-04-27T01:45:09.963574Z  INFO text_generation_launcher: Forcing attention to 'paged' because head dim is not supported by flashinfer, also disabling prefix caching
2025-04-27T01:45:09.963578Z  INFO text_generation_launcher: Using attention paged - Prefix caching 0
2025-04-27T01:45:09.963597Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-04-27T01:45:09.964112Z  INFO download: text_generation_launcher: Starting check and download process for /mnt/models/
2025-04-27T01:45:17.403007Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-04-27T01:45:18.285805Z  INFO download: text_generation_launcher: Successfully downloaded weights for /mnt/models/
2025-04-27T01:45:18.286146Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-04-27T01:45:25.790482Z  INFO text_generation_launcher: Using prefix caching = False
2025-04-27T01:45:25.790529Z  INFO text_generation_launcher: Using Attention = paged
2025-04-27T01:45:28.310241Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-04-27T01:45:35.606682Z  INFO text_generation_launcher: Using prefill chunking = False
2025-04-27T01:45:36.532700Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-04-27T01:45:36.620877Z  INFO shard-manager: text_generation_launcher: Shard ready in 18.323632258s rank=0
2025-04-27T01:45:36.705941Z  INFO text_generation_launcher: Starting Webserver
2025-04-27T01:45:36.774287Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-04-27T01:45:36.855219Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-04-27T01:45:45.542531Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1577, in warmup
    _, _batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1963, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/mllama_causal_lm.py", line 312, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/mllama.py", line 1021, in forward
    outputs = self.text_model(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 679, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 597, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 477, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 242, in forward
    attn_output = attention(
  File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 300, in attention
    .reshape(original_shape[0], -1, original_shape[2])
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-04-27T01:45:45.662979Z ERROR warmup{max_input_length=Some(4095) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Error: Backend(Warmup(Generation("CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n")))
2025-04-27T01:45:45.668918Z ERROR text_generation_launcher: Webserver Crashed
2025-04-27T01:45:45.668935Z  INFO text_generation_launcher: Shutting down shards
2025-04-27T01:45:45.734131Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2025-04-27T01:45:45.736331Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2025-04-27T01:45:46.137428Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed

Expected behavior

Expecting the model to come up and be ready to serve requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant