Skip to content

Extended queuing time when using guidance (structured generation) and grammar #3173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
jazken opened this issue Apr 15, 2025 · 0 comments
Open
2 of 4 tasks

Comments

@jazken
Copy link

jazken commented Apr 15, 2025

System Info

Running TGI on a kubernetes pod with the following arguments:
image: ghcr.io/huggingface/text-generation-inference:3.1.1
args: ["--model-id", "/models/llama3_3_70b_instruct", "--dtype", "bfloat16"]

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to reproduce behaviour:

  1. Use AsyncClient (text_generation) with generate and grammar Parameters
  2. Grammar parameters "{"type": "json", "value": Generation.schema()}"
  3. Pydantic Schema as shown below

class Citation(BaseModel):
chunk_id: uuid.UUID
citation_id: int
extract: str
url: HttpUrl

class Sentence(BaseModel):
content: str
citation_ids: List[int]

class Paragraph(BaseModel):
header: str
sentences: List[Sentence]

class Generation(BaseModel):
paragraphs: List[Paragraph]
citations: List[Citation]

logs:
Image

Key issues:
validation_time="5.669107ms" queue_time="161.454300714s"

Expected behavior

Expected queue time to be similar to normal inference or slightly longer.
Behaviour using same container without guidance or structured generation gets the following:
validation_time="5.188854ms" queue_time="129.181µs"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant