Skip to content

Conversation

@bluebread
Copy link

Enhanced the window (ggml_win_part, ggml_win_unpart) and relative position embeddings (ggml_get_rel_pos) operations in CPU/CUDA backends. They are essential for SAM and DeepSeek-OCR (#16676).

Changes

  • Add batching support to the operations
  • Extend data type support to F16/BF16 (previously limited to F32)
  • Implement CUDA support
  • Add scaling support to get_rel_pos for handling different query/key lengths
  • Add tests in test-backend-ops.cpp

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 19, 2025
@bluebread bluebread marked this pull request as draft November 19, 2025 14:56
@bluebread bluebread marked this pull request as ready for review November 19, 2025 15:01
@bluebread
Copy link
Author

I opened this PR to avoid making the final DeepSeek-OCR implementation PR too large to review. I'm still new to this project, so please let me know if this approach doesn't align with the project's workflow.

@Acly
Copy link
Collaborator

Acly commented Nov 19, 2025

I'm a bit sceptical about extending these operations. They are rather specific to SAM, and can be replaced with combination of view/permute/cont. While a "native" implementation might be a bit faster since there is one less intermediate result that has to be written to memory, in my experience this is not noticeable on GPU. The window partioning is dwarfed by the actual attention (and mul_mat/conv2d) that is usually going on.

It would be interesting to give it a try for DeepSeek-OCR. (Is there a PR for model implementation already?)

For reference:

ggml_tensor* window_partition(ggml_context* m, ggml_tensor* x, int window) {
    auto [c, w, h, b] = nelements(x);
    // same as
    // x = ggml_win_part(m, x, window);
    // x = ggml_reshape_3d(m, x, c, window * window, x->ne[3]);

    int64_t px = (window - w % window) % window;
    int64_t py = (window - h % window) % window;
    int64_t npw = (w + px) / window;
    int64_t nph = (h + py) / window;

    if (px > 0 || py > 0) {
        x = ggml_pad(m, x, 0, int(px), int(py), 0);
    }
    x = ggml_reshape_4d(m, x, c * window, npw, window, nph * b);
    x = ggml_cont(m, ggml_permute(m, x, 0, 2, 1, 3));
    x = ggml_reshape_3d(m, x, c, window * window, npw * nph * b);
    return x;
}

ggml_tensor* window_reverse(ggml_context* m, ggml_tensor* x, int w, int h, int window) {
    int64_t c = x->ne[0];
    int64_t b = x->ne[3];
    // same as
    // x = ggml_reshape_4d(m, x, c, window, window, x->ne[2]);
    // x = ggml_win_unpart(m, x, w, h, window);

    int64_t px = (window - w % window) % window;
    int64_t py = (window - h % window) % window;
    int64_t npw = (w + px) / window;
    int64_t nph = (h + py) / window;

    x = ggml_reshape_4d(m, x, c * window, window, npw, nph * b);
    x = ggml_cont(m, ggml_permute(m, x, 0, 2, 1, 3));
    x = ggml_reshape_4d(m, x, c, w + px, h + py, b);
    x = ggml_view_4d(m, x, x->ne[0], w, h, x->ne[3], x->nb[1], x->nb[2], x->nb[3], 0);
    x = ggml_cont(m, x);
    return x;
}

@bluebread
Copy link
Author

@Acly Thanks, I really appreciate the suggestion. I hadn't thought of that approach. We haven't opened a PR for DeepSeek-OCR yet and working on this feature in our repository. Should we just open one?

@am17an
Copy link
Collaborator

am17an commented Nov 20, 2025

@bluebread yes you should. If it's not ready you can open up as a draft PR. If you don't introduce any new ggml ops, it will be faster to merge, if you must then typically you just push the baseline (CPU) version first

@sfallah
Copy link
Contributor

sfallah commented Nov 20, 2025

@am17an
@bluebread
you can find the draft PR here: #17400

FYI: it is still work in progress.

@sfallah
Copy link
Contributor

sfallah commented Nov 20, 2025

@Acly
thank you for the reference.
The SAM model is working, thanks to your code with minor changes.
https://github.com/sfallah/llama.cpp/blob/sf/deepseek-ocr/tools/mtmd/clip.cpp#L2469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants