Extended queuing time when using guidance (structured generation) and grammar #3173

jazken · 2025-04-15T03:32:44Z

System Info

Running TGI on a kubernetes pod with the following arguments:
image: ghcr.io/huggingface/text-generation-inference:3.1.1
args: ["--model-id", "/models/llama3_3_70b_instruct", "--dtype", "bfloat16"]

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Steps to reproduce behaviour:

Use AsyncClient (text_generation) with generate and grammar Parameters
Grammar parameters "{"type": "json", "value": Generation.schema()}"
Pydantic Schema as shown below

class Citation(BaseModel):
chunk_id: uuid.UUID
citation_id: int
extract: str
url: HttpUrl

class Sentence(BaseModel):
content: str
citation_ids: List[int]

class Paragraph(BaseModel):
header: str
sentences: List[Sentence]

class Generation(BaseModel):
paragraphs: List[Paragraph]
citations: List[Citation]

logs:

Key issues:
validation_time="5.669107ms" queue_time="161.454300714s"

Expected behavior

Expected queue time to be similar to normal inference or slightly longer.
Behaviour using same container without guidance or structured generation gets the following:
validation_time="5.188854ms" queue_time="129.181µs"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended queuing time when using guidance (structured generation) and grammar #3173

Extended queuing time when using guidance (structured generation) and grammar #3173

jazken commented Apr 15, 2025

Extended queuing time when using guidance (structured generation) and grammar #3173

Extended queuing time when using guidance (structured generation) and grammar #3173

Comments

jazken commented Apr 15, 2025

System Info

Information

Tasks

Reproduction

Expected behavior