[codex] Preserve system prompts for no-system templates#1
Draft
mimeding wants to merge 1 commit into
Draft
Conversation
Author
|
Maintainer/review note for this PR: This PR is the upstream replacement for the Osaurus-side shim in osaurus-ai/osaurus#992. @tpae correctly pointed out that the fix belongs in What I need from maintainers here:
Local validation is green:
Once this lands, the Osaurus follow-up should be only a dependency bump plus a small regression test, and osaurus-ai/osaurus#992 can be closed or replaced. |
viktike
pushed a commit
to viktike/vmlx-swift-lm
that referenced
this pull request
May 2, 2026
Previously NemotronHOmni.prepare ran the entire prompt unchunked
through the model when text-only. For prompts > ~8k tokens this
triggered an O(L²) memory explosion in `ssmAttn` (SSM.swift):
segsum: x = MLX.repeated(x[.ellipsis, .newAxis], count: l, axis: -1)
// → [B, n_heads, L, L] = 17 GB at L=16k bf16 PER MAMBA LAYER
With 23 sequential Mamba layers and lazy-eval intermediates accumulating
across the forward pass, peak Metal-buffer allocation at L=16k hit
418 GiB on M5 Max. Reproduced under StabilityBench S5 with the new
mx::malloc large-allocation tracer (osaurus-ai/mlx@96aa27a5):
metal::malloc requested 298.32 GiB (320,320,080,000 bytes)
#0 mlx::core::ternary_op_gpu ← segsum's MLX.which mask op
osaurus-ai#1 mlx::core::gpu::eval
...
#6 BatchEngine.stepPrefill (slot.cache materialization)
metal::malloc requested 37.29 GiB (40,040,010,000 bytes)
#0 mlx::core::gpu::eval ← segsum [B, h, L, L] base
99 tracer hits captured during a single S5 (16k token) run.
Fix: chunked prefill mirroring `LLMModel.prepare`. Mamba layers carry
running state across chunks via `MambaCache` (designed for this);
attention layers update KV in place. Each `prefillStepSize` chunk
(default 512) materializes lazily-built intermediates and runs
Memory.clearCache before the next chunk, bounding peak allocation
to O(chunk_size²) per layer instead of O(prompt_length²).
Always returns `.logits` so BatchEngine never re-axises this output
and the .newAxis trap that the original .logits path was avoiding
stays dodged.
Verification on real Nemotron-3-Nano-Omni-30B-A3B-MXFP4 weights:
Row | Pre-fix peak | Pre-fix time | Post-fix peak | Post-fix time
--------+--------------+--------------+---------------+--------------
S5 16k | 418.2 GB | 85.81 s | < 512 MiB | 4.01 s
S11 60k | OOM | crash | < 512 MiB | 15.39 s
(Pre-fix S5 ran on 128 GB unified + macOS swap absorbed the over-cap;
smaller machines crashed. Pre-fix S11 60k = the 154 GB malloc Bug 2
repro from osaurus PR #967 description.)
11/11 StabilityBench rows pass with the fix. CacheCoordinator + Sample
+ MC/DC suites: 72/72 green, no regressions.
Companion: Package.swift bumps mlx-swift pin e0b6111 → 0a56f9 to pull
in the osaurus-ai/mlx@96aa27a5 mx::malloc tracer (env-gated, zero
overhead when off). Tracer used to find this bug.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This moves the Gemma/system-prompt fix to the upstream dependency instead of keeping an Osaurus-side shim for osaurus-ai/osaurus#992.
systemroles by folding them into the first user message, or inserting a synthetic user turn when needed.NoSystemMessageGenerator.RunBench/directory is absent.vmlx-swift-lm-Packagescheme and the matching DerivedData test bundle path.EvalTeststo avoid Metal command-encoder contention during the full Xcode test bundle.Validation
git diff --checkswift package describe --type jsonswift test --filter ChatMessageToolCallTestsxcodebuild build-for-testing -scheme vmlx-swift-lm-Package -destination 'platform=macOS'xcrun xctest ~/Library/Developer/Xcode/DerivedData/vmlx-swift-lm-exec-*/Build/Products/Debug/MLXLMTests.xctestNote: direct SwiftPM execution of the Metal-heavy EvalTests still hits the local
default.metallibloading limitation, so the full validation path uses the Xcode-built test bundle, matching the macOS workflow.