Skip to content

Gemma3: CUDA error: an illegal memory access was encountered #3227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
sebastianliebscher opened this issue May 14, 2025 · 0 comments
Open
2 of 4 tasks

Comments

@sebastianliebscher
Copy link

sebastianliebscher commented May 14, 2025

System Info

Text-generation-inference: v3.2.3
Driver Version: 565.57.01 CUDA Version: 12.7
GPU: DGX with 2xH100 80GB

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

env:
  - MODEL_ID
    value: "/data/.cache/google/gemma-3-27b-it"
  - name: HF_ENDPOINT
    value: "https://myendpoint/"
  - name: HF_HUB_CACHE
    value: "/data/.cache"
  - name: HOME
    value: "/data"
  - name: CUDA_LAUNCH_BLOCKING
    value: "1"
  - name: MAX_BATCH_PREFILL_TOKENS
    value: "8192"
docker run ghcr.io/huggingface/text-generation-inference:3.2.3
2025-05-14T13:15:47.787526Z  INFO text_generation_launcher: Args {
    model_id: "/data/.cache/google/gemma-3-27b-it",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        8192,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "gemma3-27b-it-58dfd497c8-mm7nr",
    port: 3000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: Off,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
    graceful_termination_timeout: 90,
}
2025-05-14T13:15:49.317192Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-05-14T13:15:49.317222Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching 0
2025-05-14T13:15:49.317246Z  INFO text_generation_launcher: Sharding model on 2 processes
2025-05-14T13:15:49.317249Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-05-14T13:15:49.317468Z  INFO download: text_generation_launcher: Starting check and download process for /data/.cache/google/gemma-3-27b-it
2025-05-14T13:15:53.810527Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-05-14T13:15:55.038541Z  INFO download: text_generation_launcher: Successfully downloaded weights for /data/.cache/google/gemma-3-27b-it
2025-05-14T13:15:55.038833Z  INFO shard-manager: text_generation_launcher: Starting shard mrank=0
2025-05-14T13:15:55.266191Z  INFO shard-manager: text_generation_launcher: Starting shard mrank=1
2025-05-14T13:15:59.786503Z  INFO text_generation_launcher: Using prefix caching = False
2025-05-14T13:15:59.786546Z  INFO text_generation_launcher: Using Attention = flashinfer
2025-05-14T13:16:05.143022Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=0
2025-05-14T13:16:05.335571Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=1
2025-05-14T13:16:15.153347Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=0
2025-05-14T13:16:15.346552Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=1
2025-05-14T13:16:25.163428Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=0
2025-05-14T13:16:25.357660Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=1
2025-05-14T13:16:27.541227Z  INFO text_generation_launcher: Using prefill chunking = False
2025-05-14T13:16:28.647809Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-05-14T13:16:28.666728Z  INFO shard-manager: text_generation_launcher: Shard ready in 33.619558779s mrank=0
2025-05-14T13:16:29.009265Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2025-05-14T13:16:29.061719Z  INFO shard-manager: text_generation_launcher: Shard ready in 33.782448874s mrank=1
2025-05-14T13:16:29.084445Z  INFO text_generation_launcher: Starting Webserver
2025-05-14T13:16:29.134771Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-05-14T13:16:29.228829Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-05-14T13:16:32.603128Z  INFO text_generation_launcher: KV-cache blocks: 142241, size: 1
2025-05-14T13:16:32.740463Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2025-05-14T13:16:35.692327Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 142241
2025-05-14T13:16:35.692413Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2025-05-14T13:16:35.692422Z  INFO text_generation_router: backends/v3/src/main.rs:162: Maximum input tokens defaulted to 8191
2025-05-14T13:16:35.692427Z  INFO text_generation_router: backends/v3/src/main.rs:168: Maximum total tokens defaulted to 8192
2025-05-14T13:16:35.694350Z  WARN text_generation_router::server: router/src/server.rs:1648: Tokenizer_config None - Some("/data/.cache/google/gemma-3-27b-it/tokenizer_config.json")
2025-05-14T13:16:35.700262Z  INFO text_generation_router::server: router/src/server.rs:1661: Using chat template from chat_template.json
2025-05-14T13:16:41.042714Z  INFO text_generation_router::server: router/src/server.rs:1716: Using config Some(Gemma3(Gemma3 { vision_config: Gemma3VisionConfig { image_size: 896, patch_size: 14 } }))
2025-05-14T13:16:41.042829Z WARN text_generation_router::server: router/src/server.rs:1776: no pipeline tag found for model /data/.cache/google/gemma-3-27b-it
2025-05-14T13:16:41.042850Z WARN text_generation_router::server: router/src/server.rs:1879: Invalid hostname, defaulting to 0.0.0.0
2025-05-14T13:16:41.217297Z  INFO text_generation_router::server: router/src/server.rs:2266: Connected
2025-05-14T13:19:09.004389Z  INFO chat_completions{mparameters="GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(15), return_full_text: None, stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: Some(\"google/gemma-3-27b-it\") }" mtotal_time="667.249736ms" mvalidation_time="972.818µs" mqueue_time="175.156186ms" minference_time="491.120838ms" mtime_per_token="98.224167ms" mseed="Some(6251085832493682637)"}: text_generation_router::server: router/src/server.rs:624: Success
2025-05-14T13:33:04.677634Z 1mERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/usr/src/.venv/bin//text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 183, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1928, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 519, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 867, in forward
    hidden_states = self.text_model.model(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 520, in forward
    hidden_states, residual = layer(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 440, in forward
    attn_output = self.self_attn(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 253, in forward
    attn_output = attention(
  File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 252, in attention
    return prefill_with_paged_kv_state.get().forward(
  File "/usr/src/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 1484, in forward
    return self.run(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
  File "/usr/src/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 1644, in run
    self._cached_module.paged_run(*run_args)
  File "/usr/src/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 371, in paged_run
    paged_run_func(
RuntimeError: BatchPrefillWithPagedKVCacheSM90Run failed with error: an illegal memory access was encountered
2025-05-14T13:33:04.678349Z 1mERROR batch{mbatch_size=2}:prefill:prefill{mid=210 msize=2}:prefill{mid=210 msize=2}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2025-05-14T13:33:07.323494Z 1mERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-05-14 13:15:56.913 | INFO     | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py:302: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lengths_tensor = torch.tensor(
CUDA Error: an illegal memory access was encountered (an illegal memory access was encountered) /tmp/pip-install-98__jjp7/flashinfer-python_a458607c05a541acaad744d95c5d211e/include/flashinfer/attention/hopper/prefill_sm90.cuh: line 356 at function cudaLaunchKernel(kernel, grid_dims, block_dims, args, smem_size, stream)
[rank0]:[W514 13:33:05.776856880 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) mrank=0
2025-05-14T13:33:07.375299Z 1mERROR text_generation_launcher: Shard 0 crashed
2025-05-14T13:33:07.375338Z  INFO text_generation_launcher: Terminating webserver
2025-05-14T13:33:07.375367Z  INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2025-05-14T13:33:07.375505Z  INFO text_generation_router::server: router/src/server.rs:2363: signal received, starting graceful shutdown

Expected behavior

Expected result should be

INFO chat_completions{mparameters="GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(15), return_full_text: None, stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: Some(\"google/gemma-3-27b-it\") }" mtotal_time="667.249736ms" mvalidation_time="972.818µs" mqueue_time="175.156186ms" minference_time="491.120838ms" mtime_per_token="98.224167ms" mseed="Some(6251085832493682637)"}: text_generation_router::server: router/src/server.rs:624: Success
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant