ggml : enhance rel-pos and window ops with CUDA support #17383

bluebread · 2025-11-19T14:16:00Z

Enhanced the window (ggml_win_part, ggml_win_unpart) and relative position embeddings (ggml_get_rel_pos) operations in CPU/CUDA backends. They are essential for SAM and DeepSeek-OCR (#16676).

Changes

Add batching support to the operations
Extend data type support to F16/BF16 (previously limited to F32)
Implement CUDA support
Add scaling support to get_rel_pos for handling different query/key lengths
Add tests in test-backend-ops.cpp

…_rel_pos

bluebread · 2025-11-19T15:29:56Z

I opened this PR to avoid making the final DeepSeek-OCR implementation PR too large to review. I'm still new to this project, so please let me know if this approach doesn't align with the project's workflow.

Acly · 2025-11-19T19:07:39Z

I'm a bit sceptical about extending these operations. They are rather specific to SAM, and can be replaced with combination of view/permute/cont. While a "native" implementation might be a bit faster since there is one less intermediate result that has to be written to memory, in my experience this is not noticeable on GPU. The window partioning is dwarfed by the actual attention (and mul_mat/conv2d) that is usually going on.

It would be interesting to give it a try for DeepSeek-OCR. (Is there a PR for model implementation already?)

For reference:

ggml_tensor* window_partition(ggml_context* m, ggml_tensor* x, int window) {
    auto [c, w, h, b] = nelements(x);
    // same as
    // x = ggml_win_part(m, x, window);
    // x = ggml_reshape_3d(m, x, c, window * window, x->ne[3]);

    int64_t px = (window - w % window) % window;
    int64_t py = (window - h % window) % window;
    int64_t npw = (w + px) / window;
    int64_t nph = (h + py) / window;

    if (px > 0 || py > 0) {
        x = ggml_pad(m, x, 0, int(px), int(py), 0);
    }
    x = ggml_reshape_4d(m, x, c * window, npw, window, nph * b);
    x = ggml_cont(m, ggml_permute(m, x, 0, 2, 1, 3));
    x = ggml_reshape_3d(m, x, c, window * window, npw * nph * b);
    return x;
}

ggml_tensor* window_reverse(ggml_context* m, ggml_tensor* x, int w, int h, int window) {
    int64_t c = x->ne[0];
    int64_t b = x->ne[3];
    // same as
    // x = ggml_reshape_4d(m, x, c, window, window, x->ne[2]);
    // x = ggml_win_unpart(m, x, w, h, window);

    int64_t px = (window - w % window) % window;
    int64_t py = (window - h % window) % window;
    int64_t npw = (w + px) / window;
    int64_t nph = (h + py) / window;

    x = ggml_reshape_4d(m, x, c * window, window, npw, nph * b);
    x = ggml_cont(m, ggml_permute(m, x, 0, 2, 1, 3));
    x = ggml_reshape_4d(m, x, c, w + px, h + py, b);
    x = ggml_view_4d(m, x, x->ne[0], w, h, x->ne[3], x->nb[1], x->nb[2], x->nb[3], 0);
    x = ggml_cont(m, x);
    return x;
}

bluebread · 2025-11-20T01:13:05Z

@Acly Thanks, I really appreciate the suggestion. I hadn't thought of that approach. We haven't opened a PR for DeepSeek-OCR yet and working on this feature in our repository. Should we just open one?

am17an · 2025-11-20T08:40:58Z

@bluebread yes you should. If it's not ready you can open up as a draft PR. If you don't introduce any new ggml ops, it will be faster to merge, if you must then typically you just push the baseline (CPU) version first

sfallah · 2025-11-20T09:15:55Z

@am17an
@bluebread
you can find the draft PR here: #17400

FYI: it is still work in progress.

sfallah · 2025-11-20T09:22:05Z

@Acly
thank you for the reference.
The SAM model is working, thanks to your code with minor changes.
https://github.com/sfallah/llama.cpp/blob/sf/deepseek-ocr/tools/mtmd/clip.cpp#L2469

bluebread added 5 commits November 19, 2025 12:21

cpu : add batching and F16/I32 support to win_part/win_unpart ops/get…

73a186b

…_rel_pos

cuda : implement CUDA backend support for rel-pos and window operations

4d52d20

ggml : add scaling to get_rel_pos for different query/key heights

72cdf76

ggml : fix get_rel_pos scaling bugs and update tests

508def2

fix: replace assert with GGML_ASSERT

8d2e3b2

bluebread requested review from ggerganov and slaren as code owners November 19, 2025 14:16

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 19, 2025

bluebread marked this pull request as draft November 19, 2025 14:56

bluebread marked this pull request as ready for review November 19, 2025 15:01

bluebread mentioned this pull request Nov 19, 2025

Implement DeepSeek3B-MoE-A570M (LM component) sfallah/llama.cpp#2

Merged

3 tasks

AgainstEntropy mentioned this pull request Nov 20, 2025

Vulkan: Add GGML_OP_GET_REL_POS #17417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : enhance rel-pos and window ops with CUDA support #17383

ggml : enhance rel-pos and window ops with CUDA support #17383

bluebread commented Nov 19, 2025

Uh oh!

bluebread commented Nov 19, 2025

Uh oh!

Acly commented Nov 19, 2025

Uh oh!

bluebread commented Nov 20, 2025

Uh oh!

am17an commented Nov 20, 2025

Uh oh!

sfallah commented Nov 20, 2025

Uh oh!

sfallah commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggml : enhance rel-pos and window ops with CUDA support #17383

Are you sure you want to change the base?

ggml : enhance rel-pos and window ops with CUDA support #17383

Conversation

bluebread commented Nov 19, 2025

Changes

Uh oh!

bluebread commented Nov 19, 2025

Uh oh!

Acly commented Nov 19, 2025

Uh oh!

bluebread commented Nov 20, 2025

Uh oh!

am17an commented Nov 20, 2025

Uh oh!

sfallah commented Nov 20, 2025

Uh oh!

sfallah commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants