tokenize route got mismatch tokens #525

franklucky001 · 2025-03-24T06:08:10Z

System Info

{
"model_id": "/data/BAAI/bge-m3",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 8192,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 48,
"version": "1.6.0",
"sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b",
"docker_label": "sha-57d8fc8"
}

docker compose

services:
  dense-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:turing-1.6
    container_name: dense-embed
    env_file: .env
    command: --model-id ${DENSE_MODEL_ID}  --pooling cls
    ports:
      - "${DENSE_PORT:-8080}:80"
    volumes: 
      - "./data:/data"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model info

BAAI/bge-m3
/tokenize

{"inputs": ["这是一个文本向量化的测试句子"]}

[
    {
        "id": 0,
        "text": "<s>",
        "special": true,
        "start": null,
        "stop": null
    },
    {
        "id": 6,
        "text": "这是一",
        "special": false,
        "start": 0,
        "stop": 3
    },
    {
        "id": 100013,
        "text": "这是一个文本向量化的测试",
        "special": false,
        "start": 0,
        "stop": 12
    },
    {
        "id": 189061,
        "text": "句子",
        "special": false,
        "start": 12,
        "stop": 18
    },
    {
        "id": 2110,
        "text": "",
        "special": false,
        "start": 18,
        "stop": 21
    },
    {
        "id": 3272,
        "text": "",
        "special": false,
        "start": 21,
        "stop": 24
    },
    {
        "id": 41904,
        "text": "",
        "special": false,
        "start": 24,
        "stop": 30
    },
    {
        "id": 49125,
        "text": "",
        "special": false,
        "start": 30,
        "stop": 36
    },
    {
        "id": 27683,
        "text": "",
        "special": false,
        "start": 36,
        "stop": 39
    },
    {
        "id": 1344,
        "text": "",
        "special": false,
        "start": 39,
        "stop": 42
    },
    {
        "id": 2,
        "text": "</s>",
        "special": true,
        "start": null,
        "stop": null
    }
]

tokenizer with transformers api

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
encoded = tokenizer("这是一个文本向量化的测试句子")
tokenizer.convert_ids_to_tokens(encoded['input_ids'])

encoded {'input_ids': [0, 6, 100013, 189061, 2110, 3272, 41904, 49125, 27683, 1344, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokens ['~~', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '~~']

token ids is OK, but token is mismatch

Expected behavior

same to transformers tokens result ['~~', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '~~']

The text was updated successfully, but these errors were encountered:

#3156) Fixing tokenization like huggingface/text-embeddings-inference#525

Narsil mentioned this issue Apr 9, 2025

Fixing the tokenization routes token (offsets are in bytes, not in #576

Merged

5 tasks

Narsil added a commit to huggingface/text-generation-inference that referenced this issue Apr 9, 2025

Fixing tokenization like huggingface/text-embeddings-inference#525

0eb4bdc

Narsil added a commit to huggingface/text-generation-inference that referenced this issue Apr 9, 2025

Fixing tokenization like https://github.com/huggingface/text-embeddin… (

9a8d046

#3156) Fixing tokenization like huggingface/text-embeddings-inference#525

Narsil closed this as completed in #576 Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize route got mismatch tokens #525

tokenize route got mismatch tokens #525

franklucky001 commented Mar 24, 2025

tokenize route got mismatch tokens #525

tokenize route got mismatch tokens #525

Comments

franklucky001 commented Mar 24, 2025

System Info

docker compose

Information

Tasks

Reproduction

model info

tokenizer with transformers api

token ids is OK, but token is mismatch

Expected behavior