UPSTREAM PR #17365: ggml : add ggml_top_k #310

loci-dev · 2025-11-24T17:36:47Z

Add a dedicated top-k op so that it can be more efficiently optimized by backend implementations. The old implementation is renamed to ggml_argsort_top_k.

TODO:

Allow unsorted output (ggml : add ggml_top_k ggml-org/llama.cpp#17365 (comment)) (see: c63ecde)
Do not rely on op_params
CUDA (will be added in #17004)
Metal

Next PRs:

Vulkan
etc.

loci-agentic-ai · 2025-11-24T18:20:59Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #310 - GGML Top-K Operation

Overview

PR #310 introduces a dedicated GGML_OP_TOP_K operation replacing the previous argsort + view implementation. The change modifies 15 files with 511 additions and 80 deletions, implementing a new operation type with optimized CPU and Metal backend support.

Key Findings

Performance-Critical Functions Impact

ggml_top_k (libggml-base.so) shows substantial improvement with response time reduced by 2337 ns (from 4458 ns to 2121 ns) and throughput improved by 25 ns (from 86 ns to 61 ns). The implementation replaces full std::sort with std::partial_sort, changing algorithmic complexity from O(n log n) to O(n log k). For typical use cases with n=16384 and k=40, this reduces comparison operations from approximately 229,376 to 87,654.

quantize_row_tq1_0 (libggml-cpu.so) demonstrates improvement with response time reduced by 18 ns (from 102 ns to 84 ns) and throughput reduced by 18 ns (from 88 ns to 69 ns), achieved through SIMD vectorization optimizations in the quantization path.

Eight parameter setter/getter functions across multiple backends show throughput increases ranging from 11 ns to 21 ns. These functions (ggml_set_op_params_f32, ggml_set_op_params_i32, ggml_get_op_params_i32) added validation and indirection layers, changing from direct struct access to pointer-based access with null checks and bounds validation. The changes affect libggml-base.so and libggml-cpu.so with consistent patterns across AMX, CPU, SGEMM, and traits implementations.

Inference Performance Impact

The modified functions operate outside the primary inference path. Functions llama_decode, llama_encode, and llama_tokenize show no changes in response time or throughput. The ggml_top_k optimization affects sampling operations during token generation but not the core decode/encode cycle. Token generation throughput remains unaffected as the top-k operation occurs after logit computation, not within the inference forward pass.

Power Consumption Analysis

Power consumption changes across all binaries remain within ±0.11%. Specific measurements: libggml-base.so decreased by 0.11% (from 71,255 nJ to 71,174 nJ), libggml-cpu.so decreased by 0.08% (from 128,302 nJ to 128,200 nJ). All other binaries (libggml.so, libllama.so, libmtmd.so, llama-bench, llama-cvector-generator, llama-run, llama-tts) show changes of 0.00%, indicating negligible energy impact from the modifications.

Implementation Details

The new ggml_top_k implementation uses std::partial_sort with thread-local buffers sized at (ne00 + CACHE_LINE_SIZE_F32) * ith to prevent false sharing. The Metal backend implements block-wise top-k selection with bitonic sort, requiring 2x temporary storage for ping-pong merge buffers. The operation explicitly produces unordered output, with the first two elements swapped to emphasize this semantic difference. MOE expert selection code correctly migrated to ggml_argsort_top_k to preserve sorted output requirements.

ggerganov added 6 commits November 24, 2025 19:17

ggml : add ggml_top_k

525040c

cont : add ggml_argsort_top_k

8f17d48

metal : add top_k support

48f1225

ggml : cleanup

5d413c3

tests : add virtual err() function for test_case

961dd4f

ggml : add comments

1e3d461

loci-dev temporarily deployed to PROD__AL_DEMO November 24, 2025 17:36 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 22 times, most recently from 17a79cb to a89c6ad Compare November 27, 2025 12:14

loci-dev force-pushed the main branch 30 times, most recently from e5edfa8 to e8163f9 Compare December 7, 2025 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17365: ggml : add ggml_top_k #310

UPSTREAM PR #17365: ggml : add ggml_top_k #310

Uh oh!

loci-dev commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17365: ggml : add ggml_top_k #310

Are you sure you want to change the base?

UPSTREAM PR #17365: ggml : add ggml_top_k #310

Uh oh!

Conversation

loci-dev commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Performance Analysis Summary: PR #310 - GGML Top-K Operation

Overview

Key Findings

Performance-Critical Functions Impact

Inference Performance Impact

Power Consumption Analysis

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants