refacto prompt building #709

NathanHB · 2025-05-07T11:59:29Z

What does this PR do?

This PR gives the prompt building logic in lighteval a much-needed spring cleaning

The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥

Highlights

Prompt Manager Overhaul: Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it).
Metrics Slimdown: Metrics now only care about samplingMethod (generative or loglikelihood). Say goodbye to use_case and all those old request types.
Request Layer Gone: Models get the raw Doc directly -—no more unnecessary request wrappers that were bloating the code.
Unified ModelResponse: All models return a single ModelResponse type, whether generative or loglikelihood. This means simpler logging and metric computation.
Consistent Metric Signatures: Every metric now uses the same function signature: compute(doc: Doc, model_response: ModelResponse).
Standardized Details: Each sample’s details now always include three fields: doc, metric, and model_response.
Generative Metrics Unified: All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0.
Removed Loglikelihood Single Token: bloated and almost not used
Tests: All tests pass, and no changes were needed to expected values.

Why?

Less code, fewer headaches.
Easier to add new benchmarks (including weird and wonderful ones).
More user-friendly inspection tools.
A single, unified way to handle prompts, responses, and metrics.

HuggingFaceDocBuilderDev · 2025-05-07T12:01:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…t-building

src/lighteval/logging/info_loggers.py

src/lighteval/metrics/__init__.py

src/lighteval/metrics/dynamic_metrics.py

src/lighteval/models/abstract_model.py

tests/test_unit_harness_metrics.py

NathanHB · 2025-06-16T14:36:25Z

src/lighteval/models/transformers/vlm_transformers_model.py

                dataloader, desc="Greedy generation", position=1, leave=True, disable=self.disable_tqdm
            ):
                batch_inputs = batch_inputs.to(self.device)
                if self.torch_dtype is not None:
                    batch_inputs = batch_inputs.to(self.torch_dtype)

                max_new_tokens = self.config.generation_size or batch_requests[0].generation_size
+                do_sample = batch_requests[0].do_sample


wether we want to use sampling or greedy is left to the user

isn't there a default value for the param though?

the default valaue is True, actally after yje discussion with @lewtun , this params is not needed anymore. It is controlled by the temperature arg (defaults to 0) and if the user wants to use sampling, the he has to set the temperature > 0

NathanHB · 2025-06-16T14:59:15Z

src/lighteval/models/transformers/transformers_model.py

@@ -650,7 +629,7 @@ def _generate(
            max_new_tokens=max_new_tokens,
            pad_token_id=self.tokenizer.pad_token_id if self.tokenizer.pad_token_id else self.tokenizer.eos_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
-            do_sample=do_sample,
+            do_sample=do_sample if generation_config.get("temperature", 1.0) > 0 else False,


do_sample will always be true, except if user set temp to 0

which is the default case when temp is not provided

Copilot

Pull Request Overview

This PR overhauls prompt-building and metrics handling across LiteEval by removing legacy wrappers, unifying on Doc and ModelResponse types, and standardizing metric signatures to use SamplingMethod.

Metrics now accept (doc: Doc, model_response: ModelResponse) and use SamplingMethod instead of legacy MetricCategory/MetricUseCase.
The global apply_... functions in metrics/__init__.py are consolidated into a single apply_metric that handles batched and per-sample metrics.
Data loading, logging, and documentation are updated to use the new Doc model, drop old request classes, and reflect the simplified API.

Reviewed Changes

Copilot reviewed 67 out of 67 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/lighteval/metrics/harness_compatibility/drop.py	Update DROP metric to use `Doc` and `ModelResponse`
src/lighteval/metrics/dynamic_metrics.py	Replace `MetricCategory` with `SamplingMethod` categories
src/lighteval/metrics/init.py	Consolidate apply functions into unified `apply_metric`
src/lighteval/main_endpoint.py	Minor docstring update
src/lighteval/logging/info_loggers.py	Rewrite `Detail` dataclass to hold `doc` and `model_response`
src/lighteval/logging/evaluation_tracker.py	Add `preview_outputs` using new `Detail` fields
src/lighteval/data.py	Swap legacy request types for `Doc` and update type hints
pyproject.toml	Broaden pytest version constraint
examples/model_configs/vllm_model_config.yaml	Add `is_async` parameter
examples/model_configs/transformers_vlm_model.yaml	Enable `use_fast_image_processor`, set `temperature: 0.0`
examples/model_configs/transformers_model.yaml	Set `temperature: 0.0`
examples/model_configs/sglang_model_config.yaml	Toggle `use_chat_template: True`
examples/custom_tasks_tests.py	Fix parameter name from `metric` to `metrics`
docs/source/saving-and-reading-results.mdx	Update detail file columns to `__doc__`, `__model_response__`, `__metric__`
docs/source/quicktour.mdx	Refresh backend list with new endpoint names
docs/source/package_reference/tasks.mdx	Remove old request classes, add `Doc`
docs/source/package_reference/models.mdx	Revise to "Model Configs", update model config entries
docs/source/adding-a-new-metric.mdx	Show updated metric signature using `Doc`/`ModelResponse`
docs/source/adding-a-custom-task.mdx	Rename `metric`→`metrics`, update parameter names
docs/source/_toctree.yml	Rename "Models and ModelConfigs" to "Model Configs"

Comments suppressed due to low confidence (4)

src/lighteval/data.py:258

This sorting criterion uses the character length of doc.query instead of the token length. It can misorder batches. Consider using the tokenized context length (e.g., len(doc.tokenized_context)).

-            -(len(query) + gen_length),

src/lighteval/logging/info_loggers.py:211

The attribute model_response.input may not exist on ModelResponse. Verify the correct property (e.g., model_response.input_tokens or model_response.text).

pprint(model_response.input)

src/lighteval/metrics/init.py:31

[nitpick] The nested loops over metrics and docs in apply_metric are hard to follow. Consider separating batched and per-sample flows into clear helper functions to improve readability.

for metric in metrics:

docs/source/adding-a-custom-task.mdx:56

The placeholder {GENERATIVE,LOGPROBS} is invalid syntax. Recommend specifying one method, e.g. SamplingMethod.GENERATIVE, or document how to pass multiple.

category=SamplingMethod.{GENERATIVE,LOGPROBS},

clefourrier · 2025-06-16T17:44:34Z

Good to review?

clefourrier

Much neater, very nice refacto of a lot of old hanging code!
Lots of very cool simplifications!

Some questions:

you removed a range of return types in function signatures and I'm not clear why
how can one now select a custom batch size? (override_bs param before)
what do you get on aime24 atm? (I'm still not sure how you manage evals where you get both sampling generative and greedy generative metrics)

clefourrier · 2025-06-17T10:20:38Z

docs/source/adding-a-custom-task.mdx

@@ -41,7 +41,7 @@ def prompt_fn(line, task_name: str = None):
        query=line["question"],
        choices=[f" {c}" for c in line["choices"]],
        gold_index=line["gold"],
-        instruction="",
+        system_prompt="",


I would add both instruction and system prompt and explain the diff between a task prompt and system prompt

clefourrier · 2025-06-17T10:22:07Z

docs/source/package_reference/models.mdx

-## Model
-### LightevalModel
-[[autodoc]] models.abstract_model.LightevalModel
+The model configs are used to define the model and its parameters. All the parameters can be


No more direct Model creation?

we used to refer the model class in the doc, now only the model configs - do we want the entry point to be the model config classes?
Only use case where this might be tricky is for people using already loaded in memory models

AH yes, well it would make more sense to have the actual models on another page imo

docs/source/quicktour.mdx

clefourrier · 2025-06-17T10:24:51Z

docs/source/quicktour.mdx

-    - `openai`: evaluate models on one or more GPUs using [🔗 OpenAI API](https://platform.openai.com/)
+    - `inference-endpoint`: evaluate models using endpoint's API
+  [Inference Endpoint](https://huggingface.co/inference-endpoints/dedicated)
+    - `tgi`: evaluate models using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index)


maybe explain that it's a tgi server running locally?

src/lighteval/data.py

src/lighteval/tasks/lighteval_task.py

clefourrier · 2025-06-17T11:18:05Z

src/lighteval/tasks/lighteval_task.py

+            doc.do_sample = True
+            doc.use_logits = True


notably incorrect if greedy eval, in which case do_sample = False

greedy evals don't really make sense imo, evals are genrative and can be greedy if the user wants to

Sorry, I was unclear, I meant that by default I don't see why do_sample should be true as it will be incorrect both for greedy generative and logprobs based evals

yeah you are right, loglikelihood actaully does not use this parameter, as for greedy generatives, it will not look at the param either.
tbh i feel like it can be removed

oh it actually very likely can since we now manage sampling at the model level rather than the doc level - it was just needed for sorting (we grouped evals of a similar type together) but I'm unsure we would still need this, as it likely should be managed differently (model level?)

yep exactly !

we can now just group using sampler number

src/lighteval/tasks/requests.py

src/lighteval/tasks/prompt_manager.py

src/lighteval/tasks/registry.py

Co-authored-by: Clémentine Fourrier <[email protected]>

…ace/lighteval into nathan-refactor-prompt-building

clefourrier

skimmed, looks like most important things are fixed - again super good job

refacto prompt building

9684d16

NathanHB added 28 commits May 19, 2025 12:00

commit

cab4027

working state for generative metrics (mocked the model)

0b1e213

working state, removed Metrictype and use_case

723daeb

working state, all metrics should work, need to unmock the models now

65c2508

remove unused functions from pipeline

4a16bec

working for transformer's greedyuntil

29e6657

working on loglikelihood but getting random results

9624de6

loglikelihood working

747ebf1

transformers model working

0358cb4

remove unused functions

471247b

all unit tests pass

62658d4

all unit tests pass

31c80cb

loglikelihood vllm works

c30ff90

end to end works

6dfc502

end to end works

d458183

all tests pass

2913dfd

all tests pass official

2c4cd69

sglang works

9675dc3

fixing more models

52dcf11

Merge remote-tracking branch 'origin/main' into nathan-refactor-promp…

9691d33

…t-building

all tests passing

8102e77

all models files were reviewed except nanotron

81992a9

working

c7502d3

load from details working

872a6be

fix tests

e42d999

documentation

c0a0b82

documentation

bbd0c11

documentation

55cdfb6

NathanHB added 5 commits June 16, 2025 12:49

documentation

3b75ac3

documentation

a8991f8

documentation

6eeee80

fix linter

b13d832

fix linter

0879ab4