Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI #3163

pavlonator · 2025-04-10T20:24:02Z

System Info

❗ Summary

Deployment of the mistralai/Ministral-8B-Instruct-2410 model using Text Generation Inference (TGI) via Google Cloud Vertex AI fails during model shard initialization due to tokenizer loading error.

📍 Environment

Platform: Google Cloud Vertex AI (Endpoints)

Container: ghcr.io/huggingface/text-generation-inference:1.4
Model ID: mistralai/Ministral-8B-Instruct-2410
CUDA: NVIDIA T4 (capability 7.5), same for A100, same for L4
HF Token: Verified & working
Model download: Succeeds (all .safetensors files retrieved)

✅ Steps to Reproduce

configure gcloud client, create project in gcloud

###cretae docker compose

version: "3.9"

services:
  mistral8b:
    container_name: mistral8b
    image: ghcr.io/huggingface/text-generation-inference:1.4
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]      
    environment:
      - MODEL_ID=mistralai/Ministral-8B-Instruct-2410
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
      - DEVICE=cuda
      - DISABLE_CUSTOM_KERNELS=false
      - MAX_INPUT_LENGTH=4096
      - MAX_TOTAL_TOKENS=8192
      - MAX_BATCH_PREFILL_TOKENS=8192
      - MAX_CONCURRENT_REQUESTS=2

configure docker

gcloud auth configure-docker us-central1-docker.pkg.dev

tag docker image

docker tag ghcr.io/huggingface/text-generation-inference:1.4 \
  us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

###push docker image

docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

upload the model to VertexAI

gcloud ai models upload \
  --region=us-central1 \
  --display-name="mistral-7b-instruct" \
  --container-image-uri=us-central1-docker.pkg.dev/learnmistral/mistral-container-repo/mistral8b-inference:latest \
  --container-health-route=/health \
  --container-predict-route=/generate \
  --container-env-vars="MODEL_ID=mistralai/Mistral-8B-Instruct-2410,HUGGING_FACE_HUB_TOKEN=your_hf_token_here,DEVICE=cuda,MAX_TOTAL_TOKENS=512,MAX_INPUT_LENGTH=256"

Note your numeric Model ID by running list models

gcloud ai models list --region=us-central1

record your Model ID you will need it in next steps

###create Vertex Endpoint

gcloud ai endpoints create \
  --region=us-central1 \
  --display-name="mistral-8b-endpoint"

response

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [****************]...done.                             

Created Vertex AI endpoint: projects/********/locations/us-central1/endpoints/*********************.

Please note endpoint ID after /endpoints/.. You will need it in ext steps

Deploy Vertex AI Endpoint via:

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-nvidia-l4" \
  --traffic-split=0=100 \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1

OR for T4

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-t4" \
  --traffic-split=0=100 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1

OR for A100

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-a100" \
  --traffic-split=0=100 \
  --machine-type=a2-highgpu-1g \
  --accelerator=type=nvidia-tesla-a100,count=1

Replace the Endpoint ID and Model ID with real numeric values you have

After a while the command line will produce a link to logs where you can see all details

❌ Error Observed (see attached log file for details )

TGI fails on model initialization with:

Exception: data did not match any variant of untagged enum ModelWrapper at line 1217944 column 3

Full stack trace includes:
tokenization_llama_fast.py

TokenizerFast.from_file(...)

AutoTokenizer.from_pretrained(...)

See full log snapshot here:

Exception: data did not match any variant of untagged enum ModelWrapper at line 1217944 column 3

(Available in full in attached log)

📂 Model Download Logs (Success)
/data/models--mistralai--Ministral-8B-Instruct-2410/snapshots/.../model-0000X-of-00004.safetensors

consolidated.safetensors downloaded successfully

Logs confirm: "Successfully downloaded weights."

🧪 Additional Verifications
✅ AutoTokenizer.from_pretrained(...) works locally on CPU

✅ huggingface-cli login and transformers-cli env confirmed

⚠️ TGI tokenizer loading fails only inside container/GCP

🤔 Hypothesis

The tokenizer JSON or fast tokenizer files (likely tokenizer.json) contain an unexpected enum tag or structure not properly parsed by tokenizers or TGI.

Possible schema incompatibility between tokenizer JSON and Rust-based tokenizers used inside TGI.

Ministral-8B-Instruct-2410 tokenizer may include experimental or malformed constructs not validated in TGI’s inference pipeline.

The tokenizer file has likely grown large (line 1217944) and might be malformed or partially truncated during sync/cache in GCP.

📌 Suggested Cross-References

MistralAI: mistralai/Ministral-8B-Instruct-2410

TGI GitHub Repository

Consider syncing with HuggingFace and Mistral teams to validate tokenizer.json file against expected schema.

🧷 Suggested Fix Paths

Validate and lint tokenizer config (tokenizer.json) for malformed entries.

Test deployment with a smaller/older version like Mistral-7B-Instruct-v0.2 (this worked).

Provide fallback or validation inside AutoTokenizer.from_pretrained() to raise meaningful error.

📎 Log File I’ve attached the full TGI shard log from GCP as evidence:

downloaded-logs-20250409-170852.json

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

✅ Steps to Reproduce

configure gcloud client, create project in gcloud

###cretae docker compose

version: "3.9"

services:
  mistral8b:
    container_name: mistral8b
    image: ghcr.io/huggingface/text-generation-inference:1.4
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]      
    environment:
      - MODEL_ID=mistralai/Ministral-8B-Instruct-2410
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
      - DEVICE=cuda
      - DISABLE_CUSTOM_KERNELS=false
      - MAX_INPUT_LENGTH=4096
      - MAX_TOTAL_TOKENS=8192
      - MAX_BATCH_PREFILL_TOKENS=8192
      - MAX_CONCURRENT_REQUESTS=2

configure docker

gcloud auth configure-docker us-central1-docker.pkg.dev

tag docker image

docker tag ghcr.io/huggingface/text-generation-inference:1.4 \
  us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

###push docker image

docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/mistral-container-repo/mistral8b-inference:latest

upload the model to VertexAI

gcloud ai models upload \
  --region=us-central1 \
  --display-name="mistral-7b-instruct" \
  --container-image-uri=us-central1-docker.pkg.dev/learnmistral/mistral-container-repo/mistral8b-inference:latest \
  --container-health-route=/health \
  --container-predict-route=/generate \
  --container-env-vars="MODEL_ID=mistralai/Mistral-8B-Instruct-2410,HUGGING_FACE_HUB_TOKEN=your_hf_token_here,DEVICE=cuda,MAX_TOTAL_TOKENS=512,MAX_INPUT_LENGTH=256"

Note your numeric Model ID by running list models

gcloud ai models list --region=us-central1

record your Model ID you will need it in next steps

###create Vertex Endpoint

gcloud ai endpoints create \
  --region=us-central1 \
  --display-name="mistral-8b-endpoint"

response

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [****************]...done.                             

Created Vertex AI endpoint: projects/********/locations/us-central1/endpoints/*********************.

Please note endpoint ID after /endpoints/.. You will need it in ext steps

Deploy Vertex AI Endpoint via:

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-nvidia-l4" \
  --traffic-split=0=100 \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1

OR for T4

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-t4" \
  --traffic-split=0=100 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1

OR for A100

gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=us-central1 \
  --model=MODEL_ID \
  --display-name="mistral-a100" \
  --traffic-split=0=100 \
  --machine-type=a2-highgpu-1g \
  --accelerator=type=nvidia-tesla-a100,count=1

Replace the Endpoint ID and Model ID with real numeric values you have

After a while the command line will produce a link to logs where you can see all details

Expected behavior

✅ Expected Behavior

The model mistralai/Ministral-8B-Instruct-2410 should load successfully in TGI when provided with:
- A valid Hugging Face token
- Correct MODEL_ID set in environment
- Sufficient GPU resources (e.g., NVIDIA T4, A100)
TGI should load the tokenizer from the model repository using the files provided:
- tokenizer_config.json
- tokenizer.json (the fast tokenizer graph)
- special_tokens_map.json
- tokenizer.model (if used for SentencePiece or BPE)

This should match the behavior of the following Python code, which works correctly:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Ministral-8B-Instruct-2410",
    trust_remote_code=True
)

TGI should deserialize the tokenizer using the Hugging Face tokenizers Rust backend.
If the tokenizer format includes fields outside the expected schema, TGI should either:
- Gracefully handle the mismatch (as transformers does), or
- Emit a clearer error message and suggest workarounds

The Rust-based deserializer in TGI currently fails with:

Exception: data did not match any variant of untagged enum ModelWrapper at line 1217944 column 3

This likely originates from the tokenizers Rust crate where .json deserialization is handled via serde. The ModelWrapper enum expects well-defined variants such as BPE, WordPiece, Unigram, or WordLevel.
Python-based transformers uses flexible class instantiation with fallback logic and can tolerate schema drift or missing fields, which TGI cannot.
Since the model is public, loads fine in transformers, and includes all expected files, the user expects it to work in text-generation-inference as well — especially since TGI is positioned as a standard HF-hosted inference backend.

The text was updated successfully, but these errors were encountered:

pavlonator · 2025-04-29T18:30:27Z

is there anybody to take a look?

pavlonator changed the title ~~Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI~~ Tokenizer loading fails for mistralai/Mistral-8B-Instruct-2410 using TGI on GCP Vertex AI Apr 10, 2025

pavlonator changed the title ~~Tokenizer loading fails for mistralai/Mistral-8B-Instruct-2410 using TGI on GCP Vertex AI~~ Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI #3163

Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI #3163

pavlonator commented Apr 10, 2025 •

edited

Loading

pavlonator commented Apr 29, 2025

Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI #3163

Tokenizer loading fails for mistralai/Ministral-8B-Instruct-2410 using TGI on GCP Vertex AI #3163

Comments

pavlonator commented Apr 10, 2025 • edited Loading

System Info

❗ Summary

📍 Environment

✅ Steps to Reproduce

configure gcloud client, create project in gcloud

configure docker

tag docker image

upload the model to VertexAI

Deploy Vertex AI Endpoint via:

🤔 Hypothesis

📌 Suggested Cross-References

🧷 Suggested Fix Paths

Information

Tasks

Reproduction

✅ Steps to Reproduce

configure gcloud client, create project in gcloud

configure docker

tag docker image

upload the model to VertexAI

Deploy Vertex AI Endpoint via:

Expected behavior

✅ Expected Behavior

pavlonator commented Apr 29, 2025

pavlonator commented Apr 10, 2025 •

edited

Loading