Differential mode for llama-bench + plotting code #13408

JohannesGaessler · 2025-05-09T14:39:08Z

I think it would be useful if there was a way to more easily compare the outputs of llama-bench as a function of context size and would therefore want to implement such a feature. What I'm imagining is something like a --differential flag which, when set, provides separate numbers for each individual model evaluation in a benchmark run instead of one number for all evaluations as a whole.

So for example with ./llama-bench -r 1 -d 1024 -n 4 -p 64 -ub 16 --differential I'm imagining something like this:

| model         | size       | params     | backend    | ngl | n_ubatch | test            | t/s                  |
| ------------- | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1024    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1040    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1056    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1072    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1024     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1025     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1026     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1027     | 115.22 ± 0.00        |

You could in principle already do something like this by just invoking llama-bench multiple times but it's kind of inconvenient.

Because reading differential data from a table is difficult I would also be adding code to plot the t/s as a function of the depth using matplotlib. I would add plotting code to compare-llama-bench.py but because a lot of the time people want just the performance for a single commit I would also add a simplified plotting script which just reads in one or more CSV tables and plots the contents of all tables in a single figure.

It would in principle also be possible to add code for fitting a polynomial to the runtime as a function of depth but doing this correctly is non-trivial. It would be possible to do this with comparatively low effort by using kafe2 but that project is licensed under the GPLv3. But I think a statistical analysis of the performance differences (and how they would extrapolate to higher context sizes) will not be needed anyways so it makes more sense to keep it simple and just do a plot.

@slaren since you are probably the biggest stakeholder for llama-bench, does the feature as described here sound useful to you? Do you have suggestions for changes?

The text was updated successfully, but these errors were encountered:

slaren · 2025-05-09T14:52:34Z

The idea looks good to me, but I don't quite understand what --differential would do based on that example. Maybe the command line parser could be extended to accept ranges instead, then you could do for example -d 1024-1030 and it would be equivalent to -d 1024,1025,1026...,1030. Maybe it could also have additional parameter to specify the step size.

JohannesGaessler · 2025-05-09T15:27:57Z

The effect of --differential would be to split single benchmark runs into multiple runs at increasing depths. The step size of the depth is the batch size with which the model is evaluated. In terms of implementation it would track one timing for each depth in the benchmark run instead of one timing for the entire benchmark run. So ./llama-bench -p 512 -ub 128,256 --differential would be equivalent to two runs with ./llama-bench -p 512 -ub 128 -d 0,128,256,384 and ./llama-bench -p 512 -ub 256 -d 0,256. The way I think about it is that the current implementation is essentially averaging over the tokens in the benchmark run and that with --differential the underlying values that are being averaged are printed instead. For me personally which interface is used doesn't matter too much because I can just wrap the binary in a bash script to get the behavior I want, I just want to make sure ahead of time that the interface of the binary is intuitive.

CISC · 2025-05-09T17:19:13Z

How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar?

slaren · 2025-05-09T17:19:32Z

Ok, I understand what would be the behavior of --differential now. To me this would not be very intuitive, and I am not convinced that using the batch size as the step size is always desirable. In any case, the idea of accepting ranges seemed useful enough to me in the general case so I implemented it in #13410. If that works for you then I think we can close this now, but I am not opposed to adding a more specific feature if this doesn't work well for your use case.

JohannesGaessler · 2025-05-09T17:51:54Z

How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar?

Unless this is you volunteering to do the implementation and maintenance yourself I will use the tools I am familiar with, i.e. NumPy and Matplotlib.

To me this would not be very intuitive, and I am not convinced that using the batch size as the step size is always desirable. In any case, the idea of accepting ranges seemed useful enough to me in the general case so I implemented it in #13410.

For now I'll extend the code in compare-llama-bench.py to support plotting as well as CSV and JSON inputs, we can worry about how to generate the inputs for the plots afterwards.

For what I'm doing specifically, usually the t/s as a function of batch size is of interest because kernels should ideally perform well across all tensor shapes. A flag like --differential would be useful because the average t/s can be easily reconstructed from the individual values so I would be able to do a single benchmark run and use the same data for a table with the average over some context depth range as well as a plot showing the t/s as a function of the context depth. But as I said, I can also solve this specific requirement of mine via writing a simple bash script. The only issue that I would currently have is that repeatedly running a small batch size at a high depth is too slow to be really feasible.

slaren · 2025-05-09T18:03:58Z

The only issue that I would currently have is that repeatedly running a small batch size at a high depth is too slow to be really feasible.

Yeah this is not great. I think it should be possible to group all the tests so that they are performed together when possible without too many changes, I may give this a try later.

CISC · 2025-05-09T18:27:53Z

How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar?
Unless this is you volunteering to do the implementation and maintenance yourself I will use the tools I am familiar with, i.e. NumPy and Matplotlib.

Yes, it was. :)

JohannesGaessler · 2025-05-09T18:37:45Z

Alright, if you're the one to contribute a solution for plotting t/s data that is also fine with me.

CISC · 2025-05-10T12:52:03Z

For now I'll extend the code in compare-llama-bench.py to support [...] CSV and JSON inputs [...].

@JohannesGaessler Are you still working on this?

JohannesGaessler · 2025-05-10T12:57:29Z

I only worked on it for a few minutes before you said that you would work on plotting code; I stashed the changes but as of right now I'm working on other things.

CISC · 2025-05-10T13:02:36Z

I only worked on it for a few minutes before you said that you would work on plotting code; I stashed the changes but as of right now I'm working on other things.

Ok, mind if I look into adding JSONL support as that does not require multiple files?

JohannesGaessler · 2025-05-10T13:05:05Z

Sure, go ahead. I'll be available for questions and code review.

JohannesGaessler self-assigned this May 9, 2025

JohannesGaessler added the enhancement New feature or request label May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differential mode for llama-bench + plotting code #13408

Differential mode for llama-bench + plotting code #13408

JohannesGaessler commented May 9, 2025

slaren commented May 9, 2025

JohannesGaessler commented May 9, 2025

CISC commented May 9, 2025

slaren commented May 9, 2025

JohannesGaessler commented May 9, 2025

slaren commented May 9, 2025

CISC commented May 9, 2025

JohannesGaessler commented May 9, 2025

CISC commented May 10, 2025

JohannesGaessler commented May 10, 2025

CISC commented May 10, 2025

JohannesGaessler commented May 10, 2025

Differential mode for llama-bench + plotting code #13408

Differential mode for llama-bench + plotting code #13408

Comments

JohannesGaessler commented May 9, 2025

slaren commented May 9, 2025

JohannesGaessler commented May 9, 2025

CISC commented May 9, 2025

slaren commented May 9, 2025

JohannesGaessler commented May 9, 2025

slaren commented May 9, 2025

CISC commented May 9, 2025

JohannesGaessler commented May 9, 2025

CISC commented May 10, 2025

JohannesGaessler commented May 10, 2025

CISC commented May 10, 2025

JohannesGaessler commented May 10, 2025