[codex] Preserve system prompts for no-system templates by mimeding · Pull Request #1 · osaurus-ai/vmlx-swift-lm

mimeding · 2026-05-01T16:12:41Z

Summary

This moves the Gemma/system-prompt fix to the upstream dependency instead of keeping an Osaurus-side shim for osaurus-ai/osaurus#992.

Preserve system instructions for templates that cannot accept system roles by folding them into the first user message, or inserting a synthetic user turn when needed.
Keep tool calls intact through NoSystemMessageGenerator.
Make a clean public checkout resolvable when the ignored local RunBench/ directory is absent.
Repair the Osaurus macOS CI path by using the vmlx-swift-lm-Package scheme and the matching DerivedData test bundle path.
Serialize the MLX sampling portions of EvalTests to avoid Metal command-encoder contention during the full Xcode test bundle.

Validation

git diff --check
swift package describe --type json
swift test --filter ChatMessageToolCallTests
xcodebuild build-for-testing -scheme vmlx-swift-lm-Package -destination 'platform=macOS'
xcrun xctest ~/Library/Developer/Xcode/DerivedData/vmlx-swift-lm-exec-*/Build/Products/Debug/MLXLMTests.xctest
- 512 tests passed locally; Gemma4 local snapshot tests skipped because the optional model snapshots are not installed.

Note: direct SwiftPM execution of the Metal-heavy EvalTests still hits the local default.metallib loading limitation, so the full validation path uses the Xcode-built test bundle, matching the macOS workflow.

mimeding · 2026-05-01T20:14:29Z

Maintainer/review note for this PR:

This PR is the upstream replacement for the Osaurus-side shim in osaurus-ai/osaurus#992. @tpae correctly pointed out that the fix belongs in vmlx-swift-lm, not in Osaurus as a Gemma-specific compatibility case.

What I need from maintainers here:

Review the NoSystemMessageGenerator behavior: system instructions are preserved by folding them into the first user turn for templates that cannot accept a system role, while tool calls remain intact.
Confirm whether this is the preferred upstream behavior for no-system templates, or whether you want the folding format adjusted.
Check/enable CI for this repo. The workflow is active, but GitHub has not attached checks to this PR yet. This branch includes a workflow fix for the stale mlx-swift-lm-Package scheme and DerivedData path, plus package-resolution fixes needed in a fresh public checkout.

Local validation is green:

git diff --check
swift package describe --type json
swift test --filter ChatMessageToolCallTests
xcodebuild build-for-testing -scheme vmlx-swift-lm-Package -destination 'platform=macOS'
full Xcode-built MLXLMTests.xctest: 512 tests passed locally

Once this lands, the Osaurus follow-up should be only a dependency bump plus a small regression test, and osaurus-ai/osaurus#992 can be closed or replaced.

Previously NemotronHOmni.prepare ran the entire prompt unchunked through the model when text-only. For prompts > ~8k tokens this triggered an O(L²) memory explosion in `ssmAttn` (SSM.swift): segsum: x = MLX.repeated(x[.ellipsis, .newAxis], count: l, axis: -1) // → [B, n_heads, L, L] = 17 GB at L=16k bf16 PER MAMBA LAYER With 23 sequential Mamba layers and lazy-eval intermediates accumulating across the forward pass, peak Metal-buffer allocation at L=16k hit 418 GiB on M5 Max. Reproduced under StabilityBench S5 with the new mx::malloc large-allocation tracer (osaurus-ai/mlx@96aa27a5): metal::malloc requested 298.32 GiB (320,320,080,000 bytes) #0 mlx::core::ternary_op_gpu ← segsum's MLX.which mask op osaurus-ai#1 mlx::core::gpu::eval ... #6 BatchEngine.stepPrefill (slot.cache materialization) metal::malloc requested 37.29 GiB (40,040,010,000 bytes) #0 mlx::core::gpu::eval ← segsum [B, h, L, L] base 99 tracer hits captured during a single S5 (16k token) run. Fix: chunked prefill mirroring `LLMModel.prepare`. Mamba layers carry running state across chunks via `MambaCache` (designed for this); attention layers update KV in place. Each `prefillStepSize` chunk (default 512) materializes lazily-built intermediates and runs Memory.clearCache before the next chunk, bounding peak allocation to O(chunk_size²) per layer instead of O(prompt_length²). Always returns `.logits` so BatchEngine never re-axises this output and the .newAxis trap that the original .logits path was avoiding stays dodged. Verification on real Nemotron-3-Nano-Omni-30B-A3B-MXFP4 weights: Row | Pre-fix peak | Pre-fix time | Post-fix peak | Post-fix time --------+--------------+--------------+---------------+-------------- S5 16k | 418.2 GB | 85.81 s | < 512 MiB | 4.01 s S11 60k | OOM | crash | < 512 MiB | 15.39 s (Pre-fix S5 ran on 128 GB unified + macOS swap absorbed the over-cap; smaller machines crashed. Pre-fix S11 60k = the 154 GB malloc Bug 2 repro from osaurus PR #967 description.) 11/11 StabilityBench rows pass with the fix. CacheCoordinator + Sample + MC/DC suites: 72/72 green, no regressions. Companion: Package.swift bumps mlx-swift pin e0b6111 → 0a56f9 to pull in the osaurus-ai/mlx@96aa27a5 mx::malloc tracer (env-gated, zero overhead when off). Tracer used to find this bug.

Fix no-system prompts and vmlx CI

ce9affa

mimeding mentioned this pull request May 1, 2026

[codex] Preserve system prompts for Gemma templates osaurus-ai/osaurus#992

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Preserve system prompts for no-system templates#1

[codex] Preserve system prompts for no-system templates#1
mimeding wants to merge 1 commit into
osaurus-ai:mainfrom
mimeding:codex/gemma-system-prompt-template

mimeding commented May 1, 2026

Uh oh!

mimeding commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mimeding commented May 1, 2026

Summary

Validation

Uh oh!

mimeding commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant