Skip to content

Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI #3163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
pavlonator opened this issue Apr 10, 2025 · 1 comment
Open
2 of 4 tasks

Comments

@pavlonator
Copy link

pavlonator commented Apr 10, 2025

System Info

❗ Summary

Deployment of the mistralai/Ministral-8B-Instruct-2410 model using Text Generation Inference (TGI) via Google Cloud Vertex AI fails during model shard initialization due to tokenizer loading error.

📍 Environment

Platform: Google Cloud Vertex AI (Endpoints)

Container: ghcr.io/huggingface/text-generation-inference:1.4
Model ID: mistralai/Ministral-8B-Instruct-2410
CUDA: NVIDIA T4 (capability 7.5), same for A100, same for L4
HF Token: Verified & working
Model download: Succeeds (all .safetensors files retrieved)

✅ Steps to Reproduce

configure gcloud client, create project in gcloud

###cretae docker compose

version: "3.9"

services:
  mistral8b:
    container_name: mistral8b
    image: ghcr.io/huggingface/text-generation-inference:1.4
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]      
    environment:
      - MODEL_ID=mistralai/Ministral-8B-Instruct-2410
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
      - DEVICE=cuda
      - DISABLE_CUSTOM_KERNELS=false
      - MAX_INPUT_LENGTH=4096
      - MAX_TOTAL_TOKENS=8192
      - MAX_BATCH_PREFILL_TOKENS=8192
      - MAX_CONCURRENT_REQUESTS=2

configure docker

gcloud auth configure-docker us-central1-docker.pkg.dev

tag docker image

docker tag ghcr.io/huggingface/text-generation-inference:1.4 \
  us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

###push docker image

docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

upload the model to VertexAI

gcloud ai models upload \
  --region=us-central1 \
  --display-name="mistral-7b-instruct" \
  --container-image-uri=us-central1-docker.pkg.dev/learnmistral/mistral-container-repo/mistral8b-inference:latest \
  --container-health-route=/health \
  --container-predict-route=/generate \
  --container-env-vars="MODEL_ID=mistralai/Mistral-8B-Instruct-2410,HUGGING_FACE_HUB_TOKEN=your_hf_token_here,DEVICE=cuda,MAX_TOTAL_TOKENS=512,MAX_INPUT_LENGTH=256"

Note your numeric Model ID by running list models

gcloud ai models list --region=us-central1

record your Model ID you will need it in next steps

###create Vertex Endpoint

gcloud ai endpoints create \
  --region=us-central1 \
  --display-name="mistral-8b-endpoint"

response

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [****************]...done.                             

Created Vertex AI endpoint: projects/********/locations/us-central1/endpoints/*********************.

Please note endpoint ID after /endpoints/.. You will need it in ext steps

Deploy Vertex AI Endpoint via:

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-nvidia-l4" \
  --traffic-split=0=100 \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1

OR for T4

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-t4" \
  --traffic-split=0=100 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1

OR for A100

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-a100" \
  --traffic-split=0=100 \
  --machine-type=a2-highgpu-1g \
  --accelerator=type=nvidia-tesla-a100,count=1

Replace the Endpoint ID and Model ID with real numeric values you have

After a while the command line will produce a link to logs where you can see all details

❌ Error Observed (see attached log file for details )

TGI fails on model initialization with:

Exception: data did not match any variant of untagged enum ModelWrapper at line 1217944 column 3

Full stack trace includes:
tokenization_llama_fast.py

TokenizerFast.from_file(...)

AutoTokenizer.from_pretrained(...)

See full log snapshot here:

Exception: data did not match any variant of untagged enum ModelWrapper at line 1217944 column 3

(Available in full in attached log)

📂 Model Download Logs (Success)
/data/models--mistralai--Ministral-8B-Instruct-2410/snapshots/.../model-0000X-of-00004.safetensors

consolidated.safetensors downloaded successfully

Logs confirm: "Successfully downloaded weights."

🧪 Additional Verifications
✅ AutoTokenizer.from_pretrained(...) works locally on CPU

✅ huggingface-cli login and transformers-cli env confirmed

⚠️ TGI tokenizer loading fails only inside container/GCP

🤔 Hypothesis

The tokenizer JSON or fast tokenizer files (likely tokenizer.json) contain an unexpected enum tag or structure not properly parsed by tokenizers or TGI.

Possible schema incompatibility between tokenizer JSON and Rust-based tokenizers used inside TGI.

Ministral-8B-Instruct-2410 tokenizer may include experimental or malformed constructs not validated in TGI’s inference pipeline.

The tokenizer file has likely grown large (line 1217944) and might be malformed or partially truncated during sync/cache in GCP.

📌 Suggested Cross-References

MistralAI: mistralai/Ministral-8B-Instruct-2410

TGI GitHub Repository

Consider syncing with HuggingFace and Mistral teams to validate tokenizer.json file against expected schema.

🧷 Suggested Fix Paths

Validate and lint tokenizer config (tokenizer.json) for malformed entries.

Test deployment with a smaller/older version like Mistral-7B-Instruct-v0.2 (this worked).

Provide fallback or validation inside AutoTokenizer.from_pretrained() to raise meaningful error.

📎 Log File I’ve attached the full TGI shard log from GCP as evidence:

downloaded-logs-20250409-170852.json

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

✅ Steps to Reproduce

configure gcloud client, create project in gcloud

###cretae docker compose

version: "3.9"

services:
  mistral8b:
    container_name: mistral8b
    image: ghcr.io/huggingface/text-generation-inference:1.4
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]      
    environment:
      - MODEL_ID=mistralai/Ministral-8B-Instruct-2410
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
      - DEVICE=cuda
      - DISABLE_CUSTOM_KERNELS=false
      - MAX_INPUT_LENGTH=4096
      - MAX_TOTAL_TOKENS=8192
      - MAX_BATCH_PREFILL_TOKENS=8192
      - MAX_CONCURRENT_REQUESTS=2

configure docker

gcloud auth configure-docker us-central1-docker.pkg.dev

tag docker image

docker tag ghcr.io/huggingface/text-generation-inference:1.4 \
  us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

###push docker image

docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

upload the model to VertexAI

gcloud ai models upload \
  --region=us-central1 \
  --display-name="mistral-7b-instruct" \
  --container-image-uri=us-central1-docker.pkg.dev/learnmistral/mistral-container-repo/mistral8b-inference:latest \
  --container-health-route=/health \
  --container-predict-route=/generate \
  --container-env-vars="MODEL_ID=mistralai/Mistral-8B-Instruct-2410,HUGGING_FACE_HUB_TOKEN=your_hf_token_here,DEVICE=cuda,MAX_TOTAL_TOKENS=512,MAX_INPUT_LENGTH=256"

Note your numeric Model ID by running list models

gcloud ai models list --region=us-central1

record your Model ID you will need it in next steps

###create Vertex Endpoint

gcloud ai endpoints create \
  --region=us-central1 \
  --display-name="mistral-8b-endpoint"

response

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [****************]...done.                             

Created Vertex AI endpoint: projects/********/locations/us-central1/endpoints/*********************.

Please note endpoint ID after /endpoints/.. You will need it in ext steps

Deploy Vertex AI Endpoint via:

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-nvidia-l4" \
  --traffic-split=0=100 \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1

OR for T4

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-t4" \
  --traffic-split=0=100 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1

OR for A100

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-a100" \
  --traffic-split=0=100 \
  --machine-type=a2-highgpu-1g \
  --accelerator=type=nvidia-tesla-a100,count=1

Replace the Endpoint ID and Model ID with real numeric values you have

After a while the command line will produce a link to logs where you can see all details

Expected behavior

✅ Expected Behavior

  • The model mistralai/Ministral-8B-Instruct-2410 should load successfully in TGI when provided with:

    • A valid Hugging Face token
    • Correct MODEL_ID set in environment
    • Sufficient GPU resources (e.g., NVIDIA T4, A100)
  • TGI should load the tokenizer from the model repository using the files provided:

    • tokenizer_config.json
    • tokenizer.json (the fast tokenizer graph)
    • special_tokens_map.json
    • tokenizer.model (if used for SentencePiece or BPE)
  • This should match the behavior of the following Python code, which works correctly:

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        "mistralai/Ministral-8B-Instruct-2410",
        trust_remote_code=True
    )
  • TGI should deserialize the tokenizer using the Hugging Face tokenizers Rust backend.

  • If the tokenizer format includes fields outside the expected schema, TGI should either:

    • Gracefully handle the mismatch (as transformers does), or
    • Emit a clearer error message and suggest workarounds
  • The Rust-based deserializer in TGI currently fails with:

    Exception: data did not match any variant of untagged enum ModelWrapper at line 1217944 column 3
    
  • This likely originates from the tokenizers Rust crate where .json deserialization is handled via serde. The ModelWrapper enum expects well-defined variants such as BPE, WordPiece, Unigram, or WordLevel.

  • Python-based transformers uses flexible class instantiation with fallback logic and can tolerate schema drift or missing fields, which TGI cannot.

  • Since the model is public, loads fine in transformers, and includes all expected files, the user expects it to work in text-generation-inference as well — especially since TGI is positioned as a standard HF-hosted inference backend.

@pavlonator pavlonator changed the title Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI Tokenizer loading fails for mistralai/Mistral-8B-Instruct-2410 using TGI on GCP Vertex AI Apr 10, 2025
@pavlonator pavlonator changed the title Tokenizer loading fails for mistralai/Mistral-8B-Instruct-2410 using TGI on GCP Vertex AI Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI Apr 10, 2025
@pavlonator
Copy link
Author

is there anybody to take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant