-
Notifications
You must be signed in to change notification settings - Fork 291
refacto prompt building #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
dataloader, desc="Greedy generation", position=1, leave=True, disable=self.disable_tqdm | ||
): | ||
batch_inputs = batch_inputs.to(self.device) | ||
if self.torch_dtype is not None: | ||
batch_inputs = batch_inputs.to(self.torch_dtype) | ||
|
||
max_new_tokens = self.config.generation_size or batch_requests[0].generation_size | ||
do_sample = batch_requests[0].do_sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wether we want to use sampling or greedy is left to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there a default value for the param though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default valaue is True, actally after yje discussion with @lewtun , this params is not needed anymore. It is controlled by the temperature arg (defaults to 0) and if the user wants to use sampling, the he has to set the temperature > 0
@@ -650,7 +629,7 @@ def _generate( | |||
max_new_tokens=max_new_tokens, | |||
pad_token_id=self.tokenizer.pad_token_id if self.tokenizer.pad_token_id else self.tokenizer.eos_token_id, | |||
eos_token_id=self.tokenizer.eos_token_id, | |||
do_sample=do_sample, | |||
do_sample=do_sample if generation_config.get("temperature", 1.0) > 0 else False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_sample will always be true, except if user set temp to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is the default case when temp is not provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR overhauls prompt-building and metrics handling across LiteEval by removing legacy wrappers, unifying on Doc
and ModelResponse
types, and standardizing metric signatures to use SamplingMethod
.
- Metrics now accept
(doc: Doc, model_response: ModelResponse)
and useSamplingMethod
instead of legacyMetricCategory
/MetricUseCase
. - The global
apply_...
functions inmetrics/__init__.py
are consolidated into a singleapply_metric
that handles batched and per-sample metrics. - Data loading, logging, and documentation are updated to use the new
Doc
model, drop old request classes, and reflect the simplified API.
Reviewed Changes
Copilot reviewed 67 out of 67 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/lighteval/metrics/harness_compatibility/drop.py | Update DROP metric to use Doc and ModelResponse |
src/lighteval/metrics/dynamic_metrics.py | Replace MetricCategory with SamplingMethod categories |
src/lighteval/metrics/init.py | Consolidate apply functions into unified apply_metric |
src/lighteval/main_endpoint.py | Minor docstring update |
src/lighteval/logging/info_loggers.py | Rewrite Detail dataclass to hold doc and model_response |
src/lighteval/logging/evaluation_tracker.py | Add preview_outputs using new Detail fields |
src/lighteval/data.py | Swap legacy request types for Doc and update type hints |
pyproject.toml | Broaden pytest version constraint |
examples/model_configs/vllm_model_config.yaml | Add is_async parameter |
examples/model_configs/transformers_vlm_model.yaml | Enable use_fast_image_processor , set temperature: 0.0 |
examples/model_configs/transformers_model.yaml | Set temperature: 0.0 |
examples/model_configs/sglang_model_config.yaml | Toggle use_chat_template: True |
examples/custom_tasks_tests.py | Fix parameter name from metric to metrics |
docs/source/saving-and-reading-results.mdx | Update detail file columns to __doc__ , __model_response__ , __metric__ |
docs/source/quicktour.mdx | Refresh backend list with new endpoint names |
docs/source/package_reference/tasks.mdx | Remove old request classes, add Doc |
docs/source/package_reference/models.mdx | Revise to "Model Configs", update model config entries |
docs/source/adding-a-new-metric.mdx | Show updated metric signature using Doc /ModelResponse |
docs/source/adding-a-custom-task.mdx | Rename metric →metrics , update parameter names |
docs/source/_toctree.yml | Rename "Models and ModelConfigs" to "Model Configs" |
Comments suppressed due to low confidence (4)
src/lighteval/data.py:258
- This sorting criterion uses the character length of
doc.query
instead of the token length. It can misorder batches. Consider using the tokenized context length (e.g.,len(doc.tokenized_context)
).
- -(len(query) + gen_length),
src/lighteval/logging/info_loggers.py:211
- The attribute
model_response.input
may not exist onModelResponse
. Verify the correct property (e.g.,model_response.input_tokens
ormodel_response.text
).
pprint(model_response.input)
src/lighteval/metrics/init.py:31
- [nitpick] The nested loops over
metrics
anddocs
inapply_metric
are hard to follow. Consider separating batched and per-sample flows into clear helper functions to improve readability.
for metric in metrics:
docs/source/adding-a-custom-task.mdx:56
- The placeholder
{GENERATIVE,LOGPROBS}
is invalid syntax. Recommend specifying one method, e.g.SamplingMethod.GENERATIVE
, or document how to pass multiple.
category=SamplingMethod.{GENERATIVE,LOGPROBS},
Good to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much neater, very nice refacto of a lot of old hanging code!
Lots of very cool simplifications!
Some questions:
- you removed a range of return types in function signatures and I'm not clear why
- how can one now select a custom batch size? (override_bs param before)
- what do you get on
aime24
atm? (I'm still not sure how you manage evals where you get both sampling generative and greedy generative metrics)
docs/source/adding-a-custom-task.mdx
Outdated
@@ -41,7 +41,7 @@ def prompt_fn(line, task_name: str = None): | |||
query=line["question"], | |||
choices=[f" {c}" for c in line["choices"]], | |||
gold_index=line["gold"], | |||
instruction="", | |||
system_prompt="", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add both instruction and system prompt and explain the diff between a task prompt and system prompt
## Model | ||
### LightevalModel | ||
[[autodoc]] models.abstract_model.LightevalModel | ||
The model configs are used to define the model and its parameters. All the parameters can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more direct Model creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdym?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we used to refer the model class in the doc, now only the model configs - do we want the entry point to be the model config classes?
Only use case where this might be tricky is for people using already loaded in memory models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH yes, well it would make more sense to have the actual models on another page imo
- `openai`: evaluate models on one or more GPUs using [🔗 OpenAI API](https://platform.openai.com/) | ||
- `inference-endpoint`: evaluate models using endpoint's API | ||
[Inference Endpoint](https://huggingface.co/inference-endpoints/dedicated) | ||
- `tgi`: evaluate models using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe explain that it's a tgi server running locally?
doc.do_sample = True | ||
doc.use_logits = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notably incorrect if greedy eval, in which case do_sample = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greedy evals don't really make sense imo, evals are genrative and can be greedy if the user wants to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was unclear, I meant that by default I don't see why do_sample should be true as it will be incorrect both for greedy generative and logprobs based evals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right, loglikelihood actaully does not use this parameter, as for greedy generatives, it will not look at the param either.
tbh i feel like it can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it actually very likely can since we now manage sampling at the model level rather than the doc level - it was just needed for sorting (we grouped evals of a similar type together) but I'm unsure we would still need this, as it likely should be managed differently (model level?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep exactly !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can now just group using sampler number
Co-authored-by: Clémentine Fourrier <[email protected]>
Co-authored-by: Clémentine Fourrier <[email protected]>
Co-authored-by: Clémentine Fourrier <[email protected]>
…ace/lighteval into nathan-refactor-prompt-building
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skimmed, looks like most important things are fixed - again super good job
What does this PR do?
This PR gives the prompt building logic in lighteval a much-needed spring cleaning
The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥
Highlights
samplingMethod
(generative or loglikelihood). Say goodbye touse_case
and all those old request types.Doc
directly -—no more unnecessaryrequest
wrappers that were bloating the code.ModelResponse
type, whether generative or loglikelihood. This means simpler logging and metric computation.compute(doc: Doc, model_response: ModelResponse)
.Why?