Skip to content

[codex] Preserve system prompts for no-system templates#1

Draft
mimeding wants to merge 1 commit into
osaurus-ai:mainfrom
mimeding:codex/gemma-system-prompt-template
Draft

[codex] Preserve system prompts for no-system templates#1
mimeding wants to merge 1 commit into
osaurus-ai:mainfrom
mimeding:codex/gemma-system-prompt-template

Conversation

@mimeding
Copy link
Copy Markdown

@mimeding mimeding commented May 1, 2026

Summary

This moves the Gemma/system-prompt fix to the upstream dependency instead of keeping an Osaurus-side shim for osaurus-ai/osaurus#992.

  • Preserve system instructions for templates that cannot accept system roles by folding them into the first user message, or inserting a synthetic user turn when needed.
  • Keep tool calls intact through NoSystemMessageGenerator.
  • Make a clean public checkout resolvable when the ignored local RunBench/ directory is absent.
  • Repair the Osaurus macOS CI path by using the vmlx-swift-lm-Package scheme and the matching DerivedData test bundle path.
  • Serialize the MLX sampling portions of EvalTests to avoid Metal command-encoder contention during the full Xcode test bundle.

Validation

  • git diff --check
  • swift package describe --type json
  • swift test --filter ChatMessageToolCallTests
  • xcodebuild build-for-testing -scheme vmlx-swift-lm-Package -destination 'platform=macOS'
  • xcrun xctest ~/Library/Developer/Xcode/DerivedData/vmlx-swift-lm-exec-*/Build/Products/Debug/MLXLMTests.xctest
    • 512 tests passed locally; Gemma4 local snapshot tests skipped because the optional model snapshots are not installed.

Note: direct SwiftPM execution of the Metal-heavy EvalTests still hits the local default.metallib loading limitation, so the full validation path uses the Xcode-built test bundle, matching the macOS workflow.

@mimeding
Copy link
Copy Markdown
Author

mimeding commented May 1, 2026

Maintainer/review note for this PR:

This PR is the upstream replacement for the Osaurus-side shim in osaurus-ai/osaurus#992. @tpae correctly pointed out that the fix belongs in vmlx-swift-lm, not in Osaurus as a Gemma-specific compatibility case.

What I need from maintainers here:

  • Review the NoSystemMessageGenerator behavior: system instructions are preserved by folding them into the first user turn for templates that cannot accept a system role, while tool calls remain intact.
  • Confirm whether this is the preferred upstream behavior for no-system templates, or whether you want the folding format adjusted.
  • Check/enable CI for this repo. The workflow is active, but GitHub has not attached checks to this PR yet. This branch includes a workflow fix for the stale mlx-swift-lm-Package scheme and DerivedData path, plus package-resolution fixes needed in a fresh public checkout.

Local validation is green:

  • git diff --check
  • swift package describe --type json
  • swift test --filter ChatMessageToolCallTests
  • xcodebuild build-for-testing -scheme vmlx-swift-lm-Package -destination 'platform=macOS'
  • full Xcode-built MLXLMTests.xctest: 512 tests passed locally

Once this lands, the Osaurus follow-up should be only a dependency bump plus a small regression test, and osaurus-ai/osaurus#992 can be closed or replaced.

viktike pushed a commit to viktike/vmlx-swift-lm that referenced this pull request May 2, 2026
Previously NemotronHOmni.prepare ran the entire prompt unchunked
through the model when text-only. For prompts > ~8k tokens this
triggered an O(L²) memory explosion in `ssmAttn` (SSM.swift):

    segsum: x = MLX.repeated(x[.ellipsis, .newAxis], count: l, axis: -1)
            // → [B, n_heads, L, L] = 17 GB at L=16k bf16 PER MAMBA LAYER

With 23 sequential Mamba layers and lazy-eval intermediates accumulating
across the forward pass, peak Metal-buffer allocation at L=16k hit
418 GiB on M5 Max. Reproduced under StabilityBench S5 with the new
mx::malloc large-allocation tracer (osaurus-ai/mlx@96aa27a5):

    metal::malloc requested 298.32 GiB (320,320,080,000 bytes)
        #0 mlx::core::ternary_op_gpu  ← segsum's MLX.which mask op
        osaurus-ai#1 mlx::core::gpu::eval
        ...
        #6 BatchEngine.stepPrefill (slot.cache materialization)

    metal::malloc requested 37.29 GiB (40,040,010,000 bytes)
        #0 mlx::core::gpu::eval  ← segsum [B, h, L, L] base

99 tracer hits captured during a single S5 (16k token) run.

Fix: chunked prefill mirroring `LLMModel.prepare`. Mamba layers carry
running state across chunks via `MambaCache` (designed for this);
attention layers update KV in place. Each `prefillStepSize` chunk
(default 512) materializes lazily-built intermediates and runs
Memory.clearCache before the next chunk, bounding peak allocation
to O(chunk_size²) per layer instead of O(prompt_length²).

Always returns `.logits` so BatchEngine never re-axises this output
and the .newAxis trap that the original .logits path was avoiding
stays dodged.

Verification on real Nemotron-3-Nano-Omni-30B-A3B-MXFP4 weights:

   Row     | Pre-fix peak | Pre-fix time | Post-fix peak | Post-fix time
   --------+--------------+--------------+---------------+--------------
   S5 16k  | 418.2 GB     | 85.81 s      | < 512 MiB     | 4.01 s
   S11 60k | OOM          | crash        | < 512 MiB     | 15.39 s

(Pre-fix S5 ran on 128 GB unified + macOS swap absorbed the over-cap;
smaller machines crashed. Pre-fix S11 60k = the 154 GB malloc Bug 2
repro from osaurus PR #967 description.)

11/11 StabilityBench rows pass with the fix. CacheCoordinator + Sample
+ MC/DC suites: 72/72 green, no regressions.

Companion: Package.swift bumps mlx-swift pin e0b6111 → 0a56f9 to pull
in the osaurus-ai/mlx@96aa27a5 mx::malloc tracer (env-gated, zero
overhead when off). Tracer used to find this bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant