Skip to content

Differential mode for llama-bench + plotting code #13408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JohannesGaessler opened this issue May 9, 2025 · 12 comments
Open

Differential mode for llama-bench + plotting code #13408

JohannesGaessler opened this issue May 9, 2025 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@JohannesGaessler
Copy link
Collaborator

I think it would be useful if there was a way to more easily compare the outputs of llama-bench as a function of context size and would therefore want to implement such a feature. What I'm imagining is something like a --differential flag which, when set, provides separate numbers for each individual model evaluation in a benchmark run instead of one number for all evaluations as a whole.

So for example with ./llama-bench -r 1 -d 1024 -n 4 -p 64 -ub 16 --differential I'm imagining something like this:

| model         | size       | params     | backend    | ngl | n_ubatch | test            | t/s                  |
| ------------- | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1024    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1040    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1056    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1072    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1024     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1025     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1026     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1027     | 115.22 ± 0.00        |

You could in principle already do something like this by just invoking llama-bench multiple times but it's kind of inconvenient.

Because reading differential data from a table is difficult I would also be adding code to plot the t/s as a function of the depth using matplotlib. I would add plotting code to compare-llama-bench.py but because a lot of the time people want just the performance for a single commit I would also add a simplified plotting script which just reads in one or more CSV tables and plots the contents of all tables in a single figure.

It would in principle also be possible to add code for fitting a polynomial to the runtime as a function of depth but doing this correctly is non-trivial. It would be possible to do this with comparatively low effort by using kafe2 but that project is licensed under the GPLv3. But I think a statistical analysis of the performance differences (and how they would extrapolate to higher context sizes) will not be needed anyways so it makes more sense to keep it simple and just do a plot.

@slaren since you are probably the biggest stakeholder for llama-bench, does the feature as described here sound useful to you? Do you have suggestions for changes?

@JohannesGaessler JohannesGaessler self-assigned this May 9, 2025
@JohannesGaessler JohannesGaessler added the enhancement New feature or request label May 9, 2025
@slaren
Copy link
Member

slaren commented May 9, 2025

The idea looks good to me, but I don't quite understand what --differential would do based on that example. Maybe the command line parser could be extended to accept ranges instead, then you could do for example -d 1024-1030 and it would be equivalent to -d 1024,1025,1026...,1030. Maybe it could also have additional parameter to specify the step size.

@JohannesGaessler
Copy link
Collaborator Author

The effect of --differential would be to split single benchmark runs into multiple runs at increasing depths. The step size of the depth is the batch size with which the model is evaluated. In terms of implementation it would track one timing for each depth in the benchmark run instead of one timing for the entire benchmark run. So ./llama-bench -p 512 -ub 128,256 --differential would be equivalent to two runs with ./llama-bench -p 512 -ub 128 -d 0,128,256,384 and ./llama-bench -p 512 -ub 256 -d 0,256. The way I think about it is that the current implementation is essentially averaging over the tokens in the benchmark run and that with --differential the underlying values that are being averaged are printed instead. For me personally which interface is used doesn't matter too much because I can just wrap the binary in a bash script to get the behavior I want, I just want to make sure ahead of time that the interface of the binary is intuitive.

@CISC
Copy link
Collaborator

CISC commented May 9, 2025

How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar?

@slaren
Copy link
Member

slaren commented May 9, 2025

Ok, I understand what would be the behavior of --differential now. To me this would not be very intuitive, and I am not convinced that using the batch size as the step size is always desirable. In any case, the idea of accepting ranges seemed useful enough to me in the general case so I implemented it in #13410. If that works for you then I think we can close this now, but I am not opposed to adding a more specific feature if this doesn't work well for your use case.

@JohannesGaessler
Copy link
Collaborator Author

How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar?

Unless this is you volunteering to do the implementation and maintenance yourself I will use the tools I am familiar with, i.e. NumPy and Matplotlib.

To me this would not be very intuitive, and I am not convinced that using the batch size as the step size is always desirable. In any case, the idea of accepting ranges seemed useful enough to me in the general case so I implemented it in #13410.

For now I'll extend the code in compare-llama-bench.py to support plotting as well as CSV and JSON inputs, we can worry about how to generate the inputs for the plots afterwards.

For what I'm doing specifically, usually the t/s as a function of batch size is of interest because kernels should ideally perform well across all tensor shapes. A flag like --differential would be useful because the average t/s can be easily reconstructed from the individual values so I would be able to do a single benchmark run and use the same data for a table with the average over some context depth range as well as a plot showing the t/s as a function of the context depth. But as I said, I can also solve this specific requirement of mine via writing a simple bash script. The only issue that I would currently have is that repeatedly running a small batch size at a high depth is too slow to be really feasible.

@slaren
Copy link
Member

slaren commented May 9, 2025

The only issue that I would currently have is that repeatedly running a small batch size at a high depth is too slow to be really feasible.

Yeah this is not great. I think it should be possible to group all the tests so that they are performed together when possible without too many changes, I may give this a try later.

@CISC
Copy link
Collaborator

CISC commented May 9, 2025

How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar?
Unless this is you volunteering to do the implementation and maintenance yourself I will use the tools I am familiar with, i.e. NumPy and Matplotlib.

Yes, it was. :)

@JohannesGaessler
Copy link
Collaborator Author

Alright, if you're the one to contribute a solution for plotting t/s data that is also fine with me.

@CISC
Copy link
Collaborator

CISC commented May 10, 2025

For now I'll extend the code in compare-llama-bench.py to support [...] CSV and JSON inputs [...].

@JohannesGaessler Are you still working on this?

@JohannesGaessler
Copy link
Collaborator Author

I only worked on it for a few minutes before you said that you would work on plotting code; I stashed the changes but as of right now I'm working on other things.

@CISC
Copy link
Collaborator

CISC commented May 10, 2025

I only worked on it for a few minutes before you said that you would work on plotting code; I stashed the changes but as of right now I'm working on other things.

Ok, mind if I look into adding JSONL support as that does not require multiple files?

@JohannesGaessler
Copy link
Collaborator Author

Sure, go ahead. I'll be available for questions and code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants