Gemma3: CUDA error: an illegal memory access was encountered #3227

sebastianliebscher · 2025-05-14T13:45:44Z

System Info

Text-generation-inference: v3.2.3
Driver Version: 565.57.01 CUDA Version: 12.7
GPU: DGX with 2xH100 80GB

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

env:
  - MODEL_ID
    value: "/data/.cache/google/gemma-3-27b-it"
  - name: HF_ENDPOINT
    value: "https://myendpoint/"
  - name: HF_HUB_CACHE
    value: "/data/.cache"
  - name: HOME
    value: "/data"
  - name: CUDA_LAUNCH_BLOCKING
    value: "1"
  - name: MAX_BATCH_PREFILL_TOKENS
    value: "8192"

docker run ghcr.io/huggingface/text-generation-inference:3.2.3

2025-05-14T13:15:47.787526Z  INFO text_generation_launcher: Args {
    model_id: "/data/.cache/google/gemma-3-27b-it",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        8192,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "gemma3-27b-it-58dfd497c8-mm7nr",
    port: 3000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: Off,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
    graceful_termination_timeout: 90,
}
2025-05-14T13:15:49.317192Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-05-14T13:15:49.317222Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching 0
2025-05-14T13:15:49.317246Z  INFO text_generation_launcher: Sharding model on 2 processes
2025-05-14T13:15:49.317249Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-05-14T13:15:49.317468Z  INFO download: text_generation_launcher: Starting check and download process for /data/.cache/google/gemma-3-27b-it
2025-05-14T13:15:53.810527Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-05-14T13:15:55.038541Z  INFO download: text_generation_launcher: Successfully downloaded weights for /data/.cache/google/gemma-3-27b-it
2025-05-14T13:15:55.038833Z  INFO shard-manager: text_generation_launcher: Starting shard mrank=0
2025-05-14T13:15:55.266191Z  INFO shard-manager: text_generation_launcher: Starting shard mrank=1
2025-05-14T13:15:59.786503Z  INFO text_generation_launcher: Using prefix caching = False
2025-05-14T13:15:59.786546Z  INFO text_generation_launcher: Using Attention = flashinfer
2025-05-14T13:16:05.143022Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=0
2025-05-14T13:16:05.335571Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=1
2025-05-14T13:16:15.153347Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=0
2025-05-14T13:16:15.346552Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=1
2025-05-14T13:16:25.163428Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=0
2025-05-14T13:16:25.357660Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... mrank=1
2025-05-14T13:16:27.541227Z  INFO text_generation_launcher: Using prefill chunking = False
2025-05-14T13:16:28.647809Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-05-14T13:16:28.666728Z  INFO shard-manager: text_generation_launcher: Shard ready in 33.619558779s mrank=0
2025-05-14T13:16:29.009265Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2025-05-14T13:16:29.061719Z  INFO shard-manager: text_generation_launcher: Shard ready in 33.782448874s mrank=1
2025-05-14T13:16:29.084445Z  INFO text_generation_launcher: Starting Webserver
2025-05-14T13:16:29.134771Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-05-14T13:16:29.228829Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-05-14T13:16:32.603128Z  INFO text_generation_launcher: KV-cache blocks: 142241, size: 1
2025-05-14T13:16:32.740463Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2025-05-14T13:16:35.692327Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 142241
2025-05-14T13:16:35.692413Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2025-05-14T13:16:35.692422Z  INFO text_generation_router: backends/v3/src/main.rs:162: Maximum input tokens defaulted to 8191
2025-05-14T13:16:35.692427Z  INFO text_generation_router: backends/v3/src/main.rs:168: Maximum total tokens defaulted to 8192
2025-05-14T13:16:35.694350Z  WARN text_generation_router::server: router/src/server.rs:1648: Tokenizer_config None - Some("/data/.cache/google/gemma-3-27b-it/tokenizer_config.json")
2025-05-14T13:16:35.700262Z  INFO text_generation_router::server: router/src/server.rs:1661: Using chat template from chat_template.json
2025-05-14T13:16:41.042714Z  INFO text_generation_router::server: router/src/server.rs:1716: Using config Some(Gemma3(Gemma3 { vision_config: Gemma3VisionConfig { image_size: 896, patch_size: 14 } }))
2025-05-14T13:16:41.042829Z WARN text_generation_router::server: router/src/server.rs:1776: no pipeline tag found for model /data/.cache/google/gemma-3-27b-it
2025-05-14T13:16:41.042850Z WARN text_generation_router::server: router/src/server.rs:1879: Invalid hostname, defaulting to 0.0.0.0
2025-05-14T13:16:41.217297Z  INFO text_generation_router::server: router/src/server.rs:2266: Connected
2025-05-14T13:19:09.004389Z  INFO chat_completions{mparameters="GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(15), return_full_text: None, stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: Some(\"google/gemma-3-27b-it\") }" mtotal_time="667.249736ms" mvalidation_time="972.818Âµs" mqueue_time="175.156186ms" minference_time="491.120838ms" mtime_per_token="98.224167ms" mseed="Some(6251085832493682637)"}: text_generation_router::server: router/src/server.rs:624: Success
2025-05-14T13:33:04.677634Z 1mERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/usr/src/.venv/bin//text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 183, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1928, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 519, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 867, in forward
    hidden_states = self.text_model.model(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 520, in forward
    hidden_states, residual = layer(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 440, in forward
    attn_output = self.self_attn(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 253, in forward
    attn_output = attention(
  File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 252, in attention
    return prefill_with_paged_kv_state.get().forward(
  File "/usr/src/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 1484, in forward
    return self.run(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
  File "/usr/src/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 1644, in run
    self._cached_module.paged_run(*run_args)
  File "/usr/src/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 371, in paged_run
    paged_run_func(
RuntimeError: BatchPrefillWithPagedKVCacheSM90Run failed with error: an illegal memory access was encountered
2025-05-14T13:33:04.678349Z 1mERROR batch{mbatch_size=2}:prefill:prefill{mid=210 msize=2}:prefill{mid=210 msize=2}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2025-05-14T13:33:07.323494Z 1mERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-05-14 13:15:56.913 | INFO     | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py:302: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lengths_tensor = torch.tensor(
CUDA Error: an illegal memory access was encountered (an illegal memory access was encountered) /tmp/pip-install-98__jjp7/flashinfer-python_a458607c05a541acaad744d95c5d211e/include/flashinfer/attention/hopper/prefill_sm90.cuh: line 356 at function cudaLaunchKernel(kernel, grid_dims, block_dims, args, smem_size, stream)
[rank0]:[W514 13:33:05.776856880 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) mrank=0
2025-05-14T13:33:07.375299Z 1mERROR text_generation_launcher: Shard 0 crashed
2025-05-14T13:33:07.375338Z  INFO text_generation_launcher: Terminating webserver
2025-05-14T13:33:07.375367Z  INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2025-05-14T13:33:07.375505Z  INFO text_generation_router::server: router/src/server.rs:2363: signal received, starting graceful shutdown

Expected behavior

Expected result should be

INFO chat_completions{mparameters="GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(15), return_full_text: None, stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: Some(\"google/gemma-3-27b-it\") }" mtotal_time="667.249736ms" mvalidation_time="972.818Âµs" mqueue_time="175.156186ms" minference_time="491.120838ms" mtime_per_token="98.224167ms" mseed="Some(6251085832493682637)"}: text_generation_router::server: router/src/server.rs:624: Success

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma3: CUDA error: an illegal memory access was encountered #3227

Gemma3: CUDA error: an illegal memory access was encountered #3227

sebastianliebscher commented May 14, 2025 •

edited

Loading

Gemma3: CUDA error: an illegal memory access was encountered #3227

Gemma3: CUDA error: an illegal memory access was encountered #3227

Comments

sebastianliebscher commented May 14, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

sebastianliebscher commented May 14, 2025 •

edited

Loading