-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Differential mode for llama-bench + plotting code #13408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The idea looks good to me, but I don't quite understand what |
The effect of |
How about Mermaid with multiple lines (multiple bars does not seem to be supported), or radar? |
Ok, I understand what would be the behavior of |
Unless this is you volunteering to do the implementation and maintenance yourself I will use the tools I am familiar with, i.e. NumPy and Matplotlib.
For now I'll extend the code in For what I'm doing specifically, usually the t/s as a function of batch size is of interest because kernels should ideally perform well across all tensor shapes. A flag like |
Yeah this is not great. I think it should be possible to group all the tests so that they are performed together when possible without too many changes, I may give this a try later. |
Yes, it was. :) |
Alright, if you're the one to contribute a solution for plotting t/s data that is also fine with me. |
@JohannesGaessler Are you still working on this? |
I only worked on it for a few minutes before you said that you would work on plotting code; I stashed the changes but as of right now I'm working on other things. |
Ok, mind if I look into adding JSONL support as that does not require multiple files? |
Sure, go ahead. I'll be available for questions and code review. |
I think it would be useful if there was a way to more easily compare the outputs of
llama-bench
as a function of context size and would therefore want to implement such a feature. What I'm imagining is something like a--differential
flag which, when set, provides separate numbers for each individual model evaluation in a benchmark run instead of one number for all evaluations as a whole.So for example with
./llama-bench -r 1 -d 1024 -n 4 -p 64 -ub 16 --differential
I'm imagining something like this:You could in principle already do something like this by just invoking
llama-bench
multiple times but it's kind of inconvenient.Because reading differential data from a table is difficult I would also be adding code to plot the t/s as a function of the depth using matplotlib. I would add plotting code to
compare-llama-bench.py
but because a lot of the time people want just the performance for a single commit I would also add a simplified plotting script which just reads in one or more CSV tables and plots the contents of all tables in a single figure.It would in principle also be possible to add code for fitting a polynomial to the runtime as a function of depth but doing this correctly is non-trivial. It would be possible to do this with comparatively low effort by using kafe2 but that project is licensed under the GPLv3. But I think a statistical analysis of the performance differences (and how they would extrapolate to higher context sizes) will not be needed anyways so it makes more sense to keep it simple and just do a plot.
@slaren since you are probably the biggest stakeholder for
llama-bench
, does the feature as described here sound useful to you? Do you have suggestions for changes?The text was updated successfully, but these errors were encountered: