Skip to content

refacto prompt building #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open

Conversation

NathanHB
Copy link
Member

@NathanHB NathanHB commented May 7, 2025

What does this PR do?

This PR gives the prompt building logic in lighteval a much-needed spring cleaning

The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥

Highlights

  • Prompt Manager Overhaul: Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it).
  • Metrics Slimdown: Metrics now only care about samplingMethod (generative or loglikelihood). Say goodbye to use_case and all those old request types.
  • Request Layer Gone: Models get the raw Doc directly -—no more unnecessary request wrappers that were bloating the code.
  • Unified ModelResponse: All models return a single ModelResponse type, whether generative or loglikelihood. This means simpler logging and metric computation.
  • Consistent Metric Signatures: Every metric now uses the same function signature: compute(doc: Doc, model_response: ModelResponse).
  • Standardized Details: Each sample’s details now always include three fields: doc, metric, and model_response.
  • Generative Metrics Unified: All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0.
  • Removed Loglikelihood Single Token: bloated and almost not used
  • Tests: All tests pass, and no changes were needed to expected values.

Why?

  • Less code, fewer headaches.
  • Easier to add new benchmarks (including weird and wonderful ones).
  • More user-friendly inspection tools.
  • A single, unified way to handle prompts, responses, and metrics.

Screenshot 2025-06-16 at 18 13 51

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dataloader, desc="Greedy generation", position=1, leave=True, disable=self.disable_tqdm
):
batch_inputs = batch_inputs.to(self.device)
if self.torch_dtype is not None:
batch_inputs = batch_inputs.to(self.torch_dtype)

max_new_tokens = self.config.generation_size or batch_requests[0].generation_size
do_sample = batch_requests[0].do_sample
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wether we want to use sampling or greedy is left to the user

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a default value for the param though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default valaue is True, actally after yje discussion with @lewtun , this params is not needed anymore. It is controlled by the temperature arg (defaults to 0) and if the user wants to use sampling, the he has to set the temperature > 0

@@ -650,7 +629,7 @@ def _generate(
max_new_tokens=max_new_tokens,
pad_token_id=self.tokenizer.pad_token_id if self.tokenizer.pad_token_id else self.tokenizer.eos_token_id,
eos_token_id=self.tokenizer.eos_token_id,
do_sample=do_sample,
do_sample=do_sample if generation_config.get("temperature", 1.0) > 0 else False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_sample will always be true, except if user set temp to 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is the default case when temp is not provided

@NathanHB NathanHB requested a review from Copilot June 16, 2025 16:25
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR overhauls prompt-building and metrics handling across LiteEval by removing legacy wrappers, unifying on Doc and ModelResponse types, and standardizing metric signatures to use SamplingMethod.

  • Metrics now accept (doc: Doc, model_response: ModelResponse) and use SamplingMethod instead of legacy MetricCategory/MetricUseCase.
  • The global apply_... functions in metrics/__init__.py are consolidated into a single apply_metric that handles batched and per-sample metrics.
  • Data loading, logging, and documentation are updated to use the new Doc model, drop old request classes, and reflect the simplified API.

Reviewed Changes

Copilot reviewed 67 out of 67 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/lighteval/metrics/harness_compatibility/drop.py Update DROP metric to use Doc and ModelResponse
src/lighteval/metrics/dynamic_metrics.py Replace MetricCategory with SamplingMethod categories
src/lighteval/metrics/init.py Consolidate apply functions into unified apply_metric
src/lighteval/main_endpoint.py Minor docstring update
src/lighteval/logging/info_loggers.py Rewrite Detail dataclass to hold doc and model_response
src/lighteval/logging/evaluation_tracker.py Add preview_outputs using new Detail fields
src/lighteval/data.py Swap legacy request types for Doc and update type hints
pyproject.toml Broaden pytest version constraint
examples/model_configs/vllm_model_config.yaml Add is_async parameter
examples/model_configs/transformers_vlm_model.yaml Enable use_fast_image_processor, set temperature: 0.0
examples/model_configs/transformers_model.yaml Set temperature: 0.0
examples/model_configs/sglang_model_config.yaml Toggle use_chat_template: True
examples/custom_tasks_tests.py Fix parameter name from metric to metrics
docs/source/saving-and-reading-results.mdx Update detail file columns to __doc__, __model_response__, __metric__
docs/source/quicktour.mdx Refresh backend list with new endpoint names
docs/source/package_reference/tasks.mdx Remove old request classes, add Doc
docs/source/package_reference/models.mdx Revise to "Model Configs", update model config entries
docs/source/adding-a-new-metric.mdx Show updated metric signature using Doc/ModelResponse
docs/source/adding-a-custom-task.mdx Rename metricmetrics, update parameter names
docs/source/_toctree.yml Rename "Models and ModelConfigs" to "Model Configs"
Comments suppressed due to low confidence (4)

src/lighteval/data.py:258

  • This sorting criterion uses the character length of doc.query instead of the token length. It can misorder batches. Consider using the tokenized context length (e.g., len(doc.tokenized_context)).
-            -(len(query) + gen_length),

src/lighteval/logging/info_loggers.py:211

  • The attribute model_response.input may not exist on ModelResponse. Verify the correct property (e.g., model_response.input_tokens or model_response.text).
pprint(model_response.input)

src/lighteval/metrics/init.py:31

  • [nitpick] The nested loops over metrics and docs in apply_metric are hard to follow. Consider separating batched and per-sample flows into clear helper functions to improve readability.
for metric in metrics:

docs/source/adding-a-custom-task.mdx:56

  • The placeholder {GENERATIVE,LOGPROBS} is invalid syntax. Recommend specifying one method, e.g. SamplingMethod.GENERATIVE, or document how to pass multiple.
category=SamplingMethod.{GENERATIVE,LOGPROBS},

@clefourrier
Copy link
Member

Good to review?

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much neater, very nice refacto of a lot of old hanging code!
Lots of very cool simplifications!

Some questions:

  • you removed a range of return types in function signatures and I'm not clear why
  • how can one now select a custom batch size? (override_bs param before)
  • what do you get on aime24 atm? (I'm still not sure how you manage evals where you get both sampling generative and greedy generative metrics)

@@ -41,7 +41,7 @@ def prompt_fn(line, task_name: str = None):
query=line["question"],
choices=[f" {c}" for c in line["choices"]],
gold_index=line["gold"],
instruction="",
system_prompt="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add both instruction and system prompt and explain the diff between a task prompt and system prompt

## Model
### LightevalModel
[[autodoc]] models.abstract_model.LightevalModel
The model configs are used to define the model and its parameters. All the parameters can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more direct Model creation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used to refer the model class in the doc, now only the model configs - do we want the entry point to be the model config classes?
Only use case where this might be tricky is for people using already loaded in memory models

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH yes, well it would make more sense to have the actual models on another page imo

- `openai`: evaluate models on one or more GPUs using [🔗 OpenAI API](https://platform.openai.com/)
- `inference-endpoint`: evaluate models using endpoint's API
[Inference Endpoint](https://huggingface.co/inference-endpoints/dedicated)
- `tgi`: evaluate models using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explain that it's a tgi server running locally?

Comment on lines 334 to 335
doc.do_sample = True
doc.use_logits = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notably incorrect if greedy eval, in which case do_sample = False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greedy evals don't really make sense imo, evals are genrative and can be greedy if the user wants to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was unclear, I meant that by default I don't see why do_sample should be true as it will be incorrect both for greedy generative and logprobs based evals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right, loglikelihood actaully does not use this parameter, as for greedy generatives, it will not look at the param either.
tbh i feel like it can be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it actually very likely can since we now manage sampling at the model level rather than the doc level - it was just needed for sorting (we grouped evals of a similar type together) but I'm unsure we would still need this, as it likely should be managed differently (model level?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep exactly !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can now just group using sampler number

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed, looks like most important things are fixed - again super good job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants