Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17365

Add a dedicated top-k op so that it can be more efficiently optimized by backend implementations. The old implementation is renamed to ggml_argsort_top_k.

TODO:

Next PRs:

  • Vulkan
  • etc.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #310 - GGML Top-K Operation

Overview

PR #310 introduces a dedicated GGML_OP_TOP_K operation replacing the previous argsort + view implementation. The change modifies 15 files with 511 additions and 80 deletions, implementing a new operation type with optimized CPU and Metal backend support.

Key Findings

Performance-Critical Functions Impact

ggml_top_k (libggml-base.so) shows substantial improvement with response time reduced by 2337 ns (from 4458 ns to 2121 ns) and throughput improved by 25 ns (from 86 ns to 61 ns). The implementation replaces full std::sort with std::partial_sort, changing algorithmic complexity from O(n log n) to O(n log k). For typical use cases with n=16384 and k=40, this reduces comparison operations from approximately 229,376 to 87,654.

quantize_row_tq1_0 (libggml-cpu.so) demonstrates improvement with response time reduced by 18 ns (from 102 ns to 84 ns) and throughput reduced by 18 ns (from 88 ns to 69 ns), achieved through SIMD vectorization optimizations in the quantization path.

Eight parameter setter/getter functions across multiple backends show throughput increases ranging from 11 ns to 21 ns. These functions (ggml_set_op_params_f32, ggml_set_op_params_i32, ggml_get_op_params_i32) added validation and indirection layers, changing from direct struct access to pointer-based access with null checks and bounds validation. The changes affect libggml-base.so and libggml-cpu.so with consistent patterns across AMX, CPU, SGEMM, and traits implementations.

Inference Performance Impact

The modified functions operate outside the primary inference path. Functions llama_decode, llama_encode, and llama_tokenize show no changes in response time or throughput. The ggml_top_k optimization affects sampling operations during token generation but not the core decode/encode cycle. Token generation throughput remains unaffected as the top-k operation occurs after logit computation, not within the inference forward pass.

Power Consumption Analysis

Power consumption changes across all binaries remain within ±0.11%. Specific measurements: libggml-base.so decreased by 0.11% (from 71,255 nJ to 71,174 nJ), libggml-cpu.so decreased by 0.08% (from 128,302 nJ to 128,200 nJ). All other binaries (libggml.so, libllama.so, libmtmd.so, llama-bench, llama-cvector-generator, llama-run, llama-tts) show changes of 0.00%, indicating negligible energy impact from the modifications.

Implementation Details

The new ggml_top_k implementation uses std::partial_sort with thread-local buffers sized at (ne00 + CACHE_LINE_SIZE_F32) * ith to prevent false sharing. The Metal backend implements block-wise top-k selection with bitonic sort, requiring 2x temporary storage for ping-pong merge buffers. The operation explicitly produces unordered output, with the first two elements swapped to emphasize this semantic difference. MOE expert selection code correctly migrated to ggml_argsort_top_k to preserve sorted output requirements.

@loci-dev loci-dev force-pushed the main branch 22 times, most recently from 17a79cb to a89c6ad Compare November 27, 2025 12:14
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from e5edfa8 to e8163f9 Compare December 7, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants