Releases · syther-labs/llama.cpp

02 Feb 00:27

53debe6

b4611 Latest

Latest

ci: use sccache on windows HIP jobs (#11553)

Assets 23

cudart-llama-bin-win-cu11.7-x64.zip

303 MB 2025-02-02T00:27:39Z
cudart-llama-bin-win-cu12.4-x64.zip

373 MB 2025-02-02T00:27:47Z
llama-b4611-bin-macos-arm64.zip

25 MB 2025-02-02T00:27:56Z
llama-b4611-bin-macos-x64.zip

26.7 MB 2025-02-02T00:27:57Z
llama-b4611-bin-ubuntu-x64.zip

28.8 MB 2025-02-02T00:27:58Z
llama-b4611-bin-win-avx-x64.zip

15.3 MB 2025-02-02T00:27:59Z
llama-b4611-bin-win-avx2-x64.zip

15.3 MB 2025-02-02T00:28:00Z
llama-b4611-bin-win-avx512-x64.zip

15.3 MB 2025-02-02T00:28:01Z
llama-b4611-bin-win-cuda-cu11.7-x64.zip

153 MB 2025-02-02T00:28:02Z
llama-b4611-bin-win-cuda-cu12.4-x64.zip

153 MB 2025-02-02T00:28:06Z
Source code (zip)

2025-02-01T18:22:38Z
Source code (tar.gz)

2025-02-01T18:22:38Z

01 Feb 12:34

github-actions

b4609

ecef206

b4609

Implement s3:// protocol (#11511)

For those that want to pull from s3

Signed-off-by: Eric Curtin <[email protected]>

Assets 23

31 Jan 18:37

github-actions

b4607

aa6fb13

b4607

`ci`: use sccache on windows instead of ccache (#11545)

* Use sccache on ci for windows

* Detect sccache in cmake

Assets 23

31 Jan 12:35

github-actions

b4604

5783575

b4604

Fix chatml fallback for unsupported builtin templates (when --jinja n…

Assets 22

31 Jan 06:18

github-actions

b4601

a2df278

b4601

server : update help metrics processing/deferred (#11512)

This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.

Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```

With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).

Assets 22

31 Jan 00:40

github-actions

b4600

553f1e4

b4600

`ci`: ccache for all github worfklows (#11516)

Assets 22

30 Jan 18:38

github-actions

b4598

27d135c

b4598

HIP: require at least HIP 5.5

Assets 23

30 Jan 12:40

github-actions

b4595

3d804de

b4595

sync: minja (#11499)

Assets 23

30 Jan 00:32

github-actions

b4589

eb7cf15

b4589

server : add /apply-template endpoint for additional use cases of Min…

Assets 23

29 Jan 18:39

github-actions

b4588

66ee4f2

b4588

vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360)

* vulkan: initial support for IQ3_S

* vulkan: initial support for IQ3_XXS

* vulkan: initial support for IQ2_XXS

* vulkan: initial support for IQ2_XS

* vulkan: optimize Q3_K by removing branches

* vulkan: implement dequantize variants for coopmat2

* vulkan: initial support for IQ2_S

* vulkan: vertically realign code

* port failing dequant callbacks from mul_mm

* Fix array length mismatches

* vulkan: avoid using workgroup size before it is referenced

* tests: increase timeout for Vulkan llvmpipe backend

---------

Co-authored-by: Jeff Bolz <[email protected]>

Assets 23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: syther-labs/llama.cpp

b4611

b4609

b4607

b4604

b4601

b4600

b4598

b4595

b4589

b4588