We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ "model_id": "/data/BAAI/bge-m3", "model_sha": null, "model_dtype": "float16", "model_type": { "embedding": { "pooling": "cls" } }, "max_concurrent_requests": 512, "max_input_length": 8192, "max_batch_tokens": 16384, "max_batch_requests": null, "max_client_batch_size": 32, "auto_truncate": false, "tokenization_workers": 48, "version": "1.6.0", "sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b", "docker_label": "sha-57d8fc8" }
services: dense-embed: image: ghcr.io/huggingface/text-embeddings-inference:turing-1.6 container_name: dense-embed env_file: .env command: --model-id ${DENSE_MODEL_ID} --pooling cls ports: - "${DENSE_PORT:-8080}:80" volumes: - "./data:/data" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu]
{"inputs": ["这是一个文本向量化的测试句子"]}
[ { "id": 0, "text": "<s>", "special": true, "start": null, "stop": null }, { "id": 6, "text": "这是一", "special": false, "start": 0, "stop": 3 }, { "id": 100013, "text": "这是一个文本向量化的测试", "special": false, "start": 0, "stop": 12 }, { "id": 189061, "text": "句子", "special": false, "start": 12, "stop": 18 }, { "id": 2110, "text": "", "special": false, "start": 18, "stop": 21 }, { "id": 3272, "text": "", "special": false, "start": 21, "stop": 24 }, { "id": 41904, "text": "", "special": false, "start": 24, "stop": 30 }, { "id": 49125, "text": "", "special": false, "start": 30, "stop": 36 }, { "id": 27683, "text": "", "special": false, "start": 36, "stop": 39 }, { "id": 1344, "text": "", "special": false, "start": 39, "stop": 42 }, { "id": 2, "text": "</s>", "special": true, "start": null, "stop": null } ]
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3") encoded = tokenizer("这是一个文本向量化的测试句子") tokenizer.convert_ids_to_tokens(encoded['input_ids'])
encoded {'input_ids': [0, 6, 100013, 189061, 2110, 3272, 41904, 49125, 27683, 1344, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} tokens ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']
same to transformers tokens result ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']
The text was updated successfully, but these errors were encountered:
Fixing tokenization like huggingface/text-embeddings-inference#525
0eb4bdc
Fixing tokenization like https://github.com/huggingface/text-embeddin… (
9a8d046
#3156) Fixing tokenization like huggingface/text-embeddings-inference#525
Successfully merging a pull request may close this issue.
System Info
{
"model_id": "/data/BAAI/bge-m3",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 8192,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 48,
"version": "1.6.0",
"sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b",
"docker_label": "sha-57d8fc8"
}
docker compose
Information
Tasks
Reproduction
model info
/tokenize
tokenizer with transformers api
token ids is OK, but token is mismatch
Expected behavior
same to transformers tokens result ['
', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']The text was updated successfully, but these errors were encountered: