[pull] master from ggerganov:master #216

pull · 2025-02-02T11:45:16Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

* add glm edge chat model * use config partial_rotary_factor as rope ratio * support for glm edge model * vision model support * remove debug info * fix format * llava.cpp trailing whitespace * remove unused AutoTokenizer * Update src/llama.cpp for not contain <|end|> or </s> Co-authored-by: Xuan Son Nguyen <[email protected]> * add edge template * fix chat template * fix confict * fix confict * fix ci err * fix format err * fix template err * 9b hf chat support * format * format clip.cpp * fix format * Apply suggestions from code review * Apply suggestions from code review * Update examples/llava/clip.cpp * fix format * minor : style --------- Co-authored-by: liyuhang <[email protected]> Co-authored-by: piDack <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: liyuhang <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* initial porting of previous LLG patch * update for new APIs * build: integrate llguidance as an external project * use '%llguidance' as marker to enable llg lark syntax * add some docs * clarify docs * code style fixes * remove llguidance.h from .gitignore * fix tests when llg is enabled * pass vocab not model to llama_sampler_init_llg() * copy test-grammar-integration.cpp to test-llguidance.cpp * clang fmt * fix ref-count bug * build and run test * gbnf -> lark syntax * conditionally include llguidance test based on LLAMA_LLGUIDANCE flag * rename llguidance test file to test-grammar-llguidance.cpp * add gh action for llg test * align tests with LLG grammar syntax and JSON Schema spec * llama_tokenizer() in fact requires valid utf8 * update llg * format file * add $LLGUIDANCE_LOG_LEVEL support * fix whitespace * fix warning * include <cmath> for INFINITY * add final newline * fail llama_sampler_init_llg() at runtime * Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes * simplify #includes * improve doc string for LLAMA_LLGUIDANCE * typo in merge * bump llguidance to 0.6.12

…I) (#11585) * `tool-call`: support Command R7B (w/ tool_plan return) * `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override * `tool-call`: test cleanup / handle lazy grammar triggers

It's more descriptive, use #define's so we can use compile-time concatenations. Signed-off-by: Eric Curtin <[email protected]>

* CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <[email protected]>

…res for amd gpus are not supersets of eatch other (#11601) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly

CUDA/HIP: add support for selectable warp size to mmv

* Fix Shift+Enter handling `exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway * build index.html.gz --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server code. The motivation for this is that when using a debug build the server would crash when an exception was throws and terminate the server process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set cpp_httplib will not call the exception handler, which would normally return a 500 error to the client. This caused tests to fail when using a debug build. Fixes: #11613

)

…chatml upon parsing issue, avoid double bos (#11616) * tool-call: allow `--jinja --chat-template chatml` * fix double bos issue (drop bos/eos tokens from jinja template) * add missing try catch around jinja parsing to default to chatml * Simplify default chatml logic

This makes git as a dependency optional, and is useful in the case where ggml is built not from git, but from a tarball, or a distribution source package. This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be using it, though, so there doesn't seem much value factor it out, or even require it.

* ggml : optimize convert f32<->f16 for loongarch_asx * ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16 * ggml : Fix warnings when run cpu CI locally on LoongArch

* common : add default embeddings presets This commit adds default embeddings presets for the following models: - bge-small-en-v1.5 - e5-small-v2 - gte-small These can be used with llama-embedding and llama-server. For example, with llama-embedding: ```console ./build/bin/llama-embedding --embd-gte-small-default -p "Hello, how are you?" ``` And with llama-server: ```console ./build/bin/llama-server --embd-gte-small-default ``` And the embeddings endpoint can then be called with a POST request: ```console curl --request POST \ --url http://localhost:8080/embeddings \ --header "Content-Type: application/json" \ --data '{"input": "Hello, how are you?"}' ``` I'm not sure if these are the most common embedding models but hopefully this can be a good starting point for discussion and further improvements. Refs: #10932

…11727) The C API in llama.h claims users can implement `llama_sampler_i` to create custom `llama_sampler`. The sampler chain takes ownership and calls `llama_sampler_free` on them. However, `llama_sampler_free` is hard-coded to use `delete`. This is undefined behavior if the object wasn't also allocated via `new` from libllama's C++ runtime. Callers in C and C-compatible languages do not use C++'s `new` operator. C++ callers may not be sharing the same heap as libllama.

* Update llama.cpp For display progress dots in terminal. Without this it didn't display dots progress during loading model from file. * Update llama.cpp removed trailing spaces

Silently insert U+FFFD(s) (Unicode replacement character) instead until the next valid codepoint can be found. This fixes `llama_tokenize` throwing an exception across the C API boundary or libllama's module boundary (the caller's runtime might be incompatible!) Returing a proper error code might be desirable, however the signature of `llama_tokenize` doesn't allow it as all return values already have existing meaning.

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

Debugged an issue with a user who was on a read-only filesystem. Signed-off-by: Eric Curtin <[email protected]>

* server : (webui) fix numeric settings being saved as string * add some more comments

After the migration to React with #11688

After the barrier in last iteration is executed, still the loop termination condition will be executed. However main thread can destroy the cgraph object and its nodes already, then another thread will access it, but the thing is already gone. Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the prior situation is possible. Last syncronization should be done after the loop to ensure the cgraph/cplan won't be accessed after the main thread exits from the function.

ggml-ci

) * redo Settings modal UI * add python code interpreter * fix auto scroll * build * fix overflow for long output lines * bring back sticky copy button * adapt layout on mobile view * fix multiple lines output and color scheme * handle python exception * better state management * add webworker * add headers * format code * speed up by loading pyodide on page load * (small tweak) add small animation to make it feels like claude

…11502)

Use the ANSI escape code for clearing a line. Signed-off-by: Eric Curtin <[email protected]>

…VRAM allocation (#11592)

Co-authored-by: Jeff Bolz <[email protected]>

typo: `\` -> `/` Change the UNIX path separator to` \`.

Technically the fixed width types come only from iostream and cstdint/stdint.h headers. memory and vector headers should not provide these. In GCC 15 the headers are cleaned up and you require the proper header cstdint. src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type 26 | uint32_t read_u32() const; | ^~~~~~~~

* server : (webui) introduce conversation branching + idb storage * mark old conv as "migrated" instead deleting them * improve migration * add more comments * more clarification

…ke systems (#11770)

* Update ggml.c * Update arg.cpp * Update speculative.h

* CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <[email protected]>

* server : use common_token_to_piece instead of common_detokenize This commit replaces the call to common_detokenize with common_token_to_piece in the populate_token_probs. The motivation for this change is to avoid an issue where common_detokenize would remove the word boundary character for tokens, which caused a regression in the server generated token probabilities. Resolves: #11728 * squash! server : use common_token_to_piece instead of common_detokenize Use common_token_to_piece for post_sampling_probs as well.

piDack and others added 4 commits February 2, 2025 09:48

Fix exotic ci env that lacks ostringstream::str (#11581)

6980448

pull bot added the ⤵️ pull label Feb 2, 2025

github-actions bot added documentation Improvements or additions to documentation testing devops python examples build server labels Feb 2, 2025

ericcurtin and others added 2 commits February 2, 2025 15:14

Name colors (#11573)

84ec8a5

It's more descriptive, use #define's so we can use compile-time concatenations. Signed-off-by: Eric Curtin <[email protected]>

CUDA: use mma PTX instructions for FlashAttention (#11583)

864a0b6

* CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <[email protected]>

github-actions bot added ggml Nvidia GPU labels Feb 2, 2025

ochafik and others added 7 commits February 2, 2025 19:58

nit: more informative crash when grammar sampler fails (#11593)

90f9b88

HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectu…

4d0598e

…res for amd gpus are not supersets of eatch other (#11601) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly

CUDA/HIP: add support for selectable warp size to mmv (#11519)

396856b

CUDA/HIP: add support for selectable warp size to mmv

HIP: fix flash_attn_stream_k_fixup warning (#11604)

6eecde3

server : (webui) Fix Shift+Enter handling (#11609)

d92cb67

* Fix Shift+Enter handling `exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway * build index.html.gz --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

CUDA: fix Volta FlashAttention logic (#11615)

21c84b5

sync : ggml

8ec0583

github-actions bot added the script label Feb 3, 2025

danbev and others added 6 commits February 3, 2025 16:45

server : (webui) allow typing and submitting during llm response (#11626

1d1e6a9

)

server : (webui) revert hacky solution from #11626 (#11634)

b345178

ci : do not stale-close roadmap issues

b34aedd

MQ-mengqing and others added 30 commits February 7, 2025 09:38

ggml : optimize and build warning fix for LoongArch (#11709)

225bbbf

* ggml : optimize convert f32<->f16 for loongarch_asx * ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16 * ggml : Fix warnings when run cpu CI locally on LoongArch

SYCL: remove XMX info from print devices (#11712)

ec3bc82

vulkan: print shared memory size (#11719)

c026ba3

llama : fix progress dots (#11730)

333820d

* Update llama.cpp For display progress dots in terminal. Without this it didn't display dots progress during loading model from file. * Update llama.cpp removed trailing spaces

llama : fix defrag logic (#11707)

ed926d8

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

Make logging more verbose (#11714)

d2fe216

Debugged an issue with a user who was on a read-only filesystem. Signed-off-by: Eric Curtin <[email protected]>

server : (webui) fix numeric settings being saved as string (#11739)

0cf8671

* server : (webui) fix numeric settings being saved as string * add some more comments

readme : update front-end framework (#11753)

3ab410f

After the migration to React with #11688

CUDA: fix min. version for movmatrix (#11751)

d80be89

cont : fix mmap flag print (#11699)

bdcf8b6

server : minor log updates (#11760)

aaa5505

ggml-ci

server : (webui) increase edit textarea size (#11763)

e6e6583

vulkan: account for lookup tables when checking shared memory size (#…

98f6b0f

…11502)

There's a better way of clearing lines (#11756)

19d3c82

Use the ANSI escape code for clearing a line. Signed-off-by: Eric Curtin <[email protected]>

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

b044a0f

…VRAM allocation (#11592)

vulkan: Make Vulkan optional at runtime (#11493). (#11494)

c2a67ef

Co-authored-by: Jeff Bolz <[email protected]>

Update README.md [no ci] (#11781)

9ac3457

typo: `\` -> `/` Change the UNIX path separator to` \`.

sync: minja (google/minja@a72057e) (#11774)

d7b31a9

server : correct signal handler (#11795)

0893e01

server : (webui) introduce conversation branching + idb storage (#11792)

507f917

* server : (webui) introduce conversation branching + idb storage * mark old conv as "migrated" instead deleting them * improve migration * add more comments * more clarification

docs: utilize the forward slash (/) as the path separator for Unix-li…

8173261

…ke systems (#11770)

fix: typos in documentation files (#11791)

7b891bd

* Update ggml.c * Update arg.cpp * Update speculative.h

CUDA: use arch list for compatibility check (#11775)

b9ab0a4

* CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #216

[pull] master from ggerganov:master #216

pull bot commented Feb 2, 2025 •

edited

Loading

[pull] master from ggerganov:master #216

Are you sure you want to change the base?

[pull] master from ggerganov:master #216

Conversation

pull bot commented Feb 2, 2025 • edited Loading

pull bot commented Feb 2, 2025 •

edited

Loading