Releases · ggml-org/llama.cpp

10 May 07:42

d891942

b5335 Latest

Latest

CUDA: fix FlashAttention on Turing (#13415)

Assets 20

cudart-llama-bin-win-cuda11.7-x64.zip

303 MB 2025-05-10T07:42:27Z
cudart-llama-bin-win-cuda12.4-x64.zip

373 MB 2025-05-10T07:42:38Z
llama-b5335-bin-macos-arm64.zip

9.85 MB 2025-05-10T07:42:51Z
llama-b5335-bin-macos-x64.zip

23.4 MB 2025-05-10T07:42:52Z
llama-b5335-bin-ubuntu-arm64.zip

10.5 MB 2025-05-10T07:42:53Z
llama-b5335-bin-ubuntu-vulkan-x64.zip

18.6 MB 2025-05-10T07:42:54Z
llama-b5335-bin-ubuntu-x64.zip

11 MB 2025-05-10T07:42:55Z
llama-b5335-bin-win-cpu-arm64.zip

11.5 MB 2025-05-10T07:42:56Z
llama-b5335-bin-win-cpu-x64.zip

12.6 MB 2025-05-10T07:42:57Z
llama-b5335-bin-win-cuda11.7-x64.zip

127 MB 2025-05-10T07:42:58Z
Source code (zip)

2025-05-10T07:16:52Z
Source code (tar.gz)

2025-05-10T07:16:52Z

10 May 06:56

github-actions

b5334

7fef117

b5334

arg : add env var to control mmproj (#13416)

* arg : add env var to control mmproj

* small note about -hf --mmproj

Assets 20

10 May 06:35

github-actions

b5333

dc1d2ad

b5333

vulkan: scalar flash attention implementation (#13324)

* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA

Assets 20

09 May 20:44

github-actions

b5332

7c28a74

b5332

chore(llguidance): use tagged version that does not break the build (…

Assets 20

09 May 18:20

github-actions

b5331

33eff40

b5331

 server : vision support via libmtmd (#12898)

* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 20

09 May 16:58

github-actions

b5330

17512a9

b5330

sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs  (#12858)

* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <[email protected]>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <[email protected]>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <[email protected]>
Co-authored-by: romain.biessy <[email protected]>

Assets 20

09 May 16:38

github-actions

b5329

611aa91

b5329

metal : optimize MoE for large batches (#13388)

ggml-ci

Assets 20

09 May 13:28

github-actions

b5328

0cf6725

b5328

CUDA: FA support for Deepseek (Ampere or newer) (#13306)

* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template

Assets 20

09 May 12:08

github-actions

b5327

27ebfca

b5327

llama : do not crash if there is no CPU backend (#13395)

* llama : do not crash if there is no CPU backend

* add checks to examples

Assets 20

09 May 11:45

github-actions

b5326

5c86c9e

b5326

CUDA: fix crash on large batch size for MoE models (#13384)

Assets 20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggml-org/llama.cpp

b5335

b5334

b5333

b5332

b5331

b5330

b5329

b5328

b5327

b5326