Skip to content

tokenize route got mismatch tokens #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 4 tasks
franklucky001 opened this issue Mar 24, 2025 · 0 comments · Fixed by #576
Closed
1 of 4 tasks

tokenize route got mismatch tokens #525

franklucky001 opened this issue Mar 24, 2025 · 0 comments · Fixed by #576

Comments

@franklucky001
Copy link

System Info

{
"model_id": "/data/BAAI/bge-m3",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 8192,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 48,
"version": "1.6.0",
"sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b",
"docker_label": "sha-57d8fc8"
}

docker compose

services:
  dense-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:turing-1.6
    container_name: dense-embed
    env_file: .env
    command: --model-id ${DENSE_MODEL_ID}  --pooling cls
    ports:
      - "${DENSE_PORT:-8080}:80"
    volumes: 
      - "./data:/data"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model info

  • BAAI/bge-m3
    /tokenize
{"inputs": ["这是一个文本向量化的测试句子"]}
[
    {
        "id": 0,
        "text": "<s>",
        "special": true,
        "start": null,
        "stop": null
    },
    {
        "id": 6,
        "text": "这是一",
        "special": false,
        "start": 0,
        "stop": 3
    },
    {
        "id": 100013,
        "text": "这是一个文本向量化的测试",
        "special": false,
        "start": 0,
        "stop": 12
    },
    {
        "id": 189061,
        "text": "句子",
        "special": false,
        "start": 12,
        "stop": 18
    },
    {
        "id": 2110,
        "text": "",
        "special": false,
        "start": 18,
        "stop": 21
    },
    {
        "id": 3272,
        "text": "",
        "special": false,
        "start": 21,
        "stop": 24
    },
    {
        "id": 41904,
        "text": "",
        "special": false,
        "start": 24,
        "stop": 30
    },
    {
        "id": 49125,
        "text": "",
        "special": false,
        "start": 30,
        "stop": 36
    },
    {
        "id": 27683,
        "text": "",
        "special": false,
        "start": 36,
        "stop": 39
    },
    {
        "id": 1344,
        "text": "",
        "special": false,
        "start": 39,
        "stop": 42
    },
    {
        "id": 2,
        "text": "</s>",
        "special": true,
        "start": null,
        "stop": null
    }
]

tokenizer with transformers api

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
encoded = tokenizer("这是一个文本向量化的测试句子")
tokenizer.convert_ids_to_tokens(encoded['input_ids'])
  • encoded {'input_ids': [0, 6, 100013, 189061, 2110, 3272, 41904, 49125, 27683, 1344, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
  • tokens ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']

token ids is OK, but token is mismatch

Expected behavior

same to transformers tokens result ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']

Narsil added a commit to huggingface/text-generation-inference that referenced this issue Apr 9, 2025
Narsil added a commit to huggingface/text-generation-inference that referenced this issue Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant