fix flaky tests and updated headers for sqlite collision prevention#956
Conversation
|
CI update from checking this against #954/#955: #956 still fails Observed failures in run https://github.com/osaurus-ai/osaurus/actions/runs/24996830968/job/73197126271:
The direct Two concrete follow-ups that look worth checking:
The run used Xcode 26.3. That was not the immediate failure here because the package built and tests launched, but #954/#955 already hit an unrelated EventSource transitive-module build failure on 26.3 before I aligned their CI pins to 26.4.1. After the flake fixes are green, it may still be worth aligning this branch/main CI with the release workflow’s Xcode pin. |
PR-event runs of release-drafter@v7 fail with:
Validation Failed: {"resource":"Release","code":"invalid",
"field":"target_commitish"}
Triggered by v7's `Updating existing release...` step which sends
`target_commitish: refs/pull/<N>/merge` — a value GitHub's REST
`/releases` API rejects (must be a real branch). v6 omits the field
on PR runs and only updates the draft body, which is what we want for
the PR-time preview.
Empirical evidence:
PR #944 / #951 / #956 — passed on v6 (the prior pin).
PR #953 — failed on v7 (introduced by commit 56c4dc7 "fix release
drafter" on main 2026-04-27, which mistakenly upgraded the
action while attempting to fix something else).
Revert to v6 restores green CI. The push-to-main run keeps working on
either version because that path provides a real branch ref.
…correct (#953) * fix(preflight): detect mislabeled JANGTQ bundles (sidecar present + non-mxtq stamp) Bundles that ship jangtq_runtime.safetensors but stamp jang_config.json as weight_format: "bf16" (or omit the field entirely) cause vmlx to dispatch to the BASE DeepseekV4Model / MiniMaxModel / etc. class. The safetensors then carry tq_norms / tq_packed tensors the base parameter loader can't decode, surfacing as: Error: Unhandled keys ["tq_norms", "tq_packed"] in model.layers.0.mlp.switch_mlp.up_proj in DeepseekV4Model...DeepseekV4MoE.SwitchGLU.SwitchLinear Live-confirmed 2026-04-25 on an early DSV4-Flash JANGTQ bundle stamped "bf16" — the case the vmlx integration doc explicitly notes via the DSV4_FORCE_JANGTQ=1 env-var workaround. This commit extends validateJANGTQSidecarIfRequired with the inverse mismatch check: sidecar present + weight_format != "mxtq" → throw a clear remediation error instead of letting vmlx consume 60+ shards before failing on Unhandled keys. Genuine bf16 dense bundles (DSV4-Flash-JANG_2L, etc.) ship NO sidecar and continue to pass through. Three new tests cover both mislabel shapes (bf16+sidecar, absent+sidecar) plus a regression test confirming the bf16-no-sidecar happy path still works. * deps: pin vmlx-swift-lm by revision instead of branch (per tpae's new convention) Switch from `branch: "main"` to `revision: "c992df9..."` in Packages/OsaurusCore/Package.swift, and strip the now-redundant `branch: "main"` field from osaurus.xcworkspace's Package.resolved `state` block. SwiftPM keeps the revision-only state when the source declaration uses revision pinning; collaborators get the matching lock file by re-resolving the workspace (Xcode 'File ▸ Packages ▸ Reset Package Caches' or `xcodebuild -resolvePackageDependencies`). Why: prevents unwanted upgrades from the upstream main branch moving, and makes local development against vmlx hashes deterministic across contributors. Branch-based pinning silently picked up new commits on every package refresh. Build verified: `xcodebuild build` against the workspace succeeds at this pin. * feat(reasoning): surface vmlx unclosedReasoning to chat UI as a trapped-thinking warning Wires vmlx-swift-lm c992df9's GenerateCompletionInfo.unclosedReasoning flag through the osaurus runtime + chat UI. Reasoning-trained Qwen3.6-A3B and DSV4 fine-tunes can get trapped doing meta-cognitive self-validation loops on prompts like 'give me a 20-digit number' — the model emits [Final Check] / Self-Correction tokens forever and never closes </think>, so the visible content channel stays empty while the answer sits buried in the reasoning pane. vmlx now detects this case (insideReasoning at stream end) and surfaces it on Generation.info(GenerateCompletionInfo.unclosedReasoning). This commit threads that signal end-to-end: vmlx GenerateCompletionInfo.unclosedReasoning ↓ GenerationEventMapper passthrough (logs it on the [perf] mlxStats line) ↓ ModelRuntimeEvent.completionInfo(..., unclosedReasoning: Bool) ↓ ModelRuntime.streamWithTools encodes via StreamingStatsHint (wire format: "<sentinel>stats:<count>;<tps>[;unclosed]" — third field is comma-separated flags so future signals stay backward-compat) ↓ ChatView decodes + persists onto ChatTurn.unclosedReasoning ↓ ContentBlock.generationStats carries the flag to the cell renderer ↓ NativeStatsView appends a one-line warning chip beside the tok/s counter: '⚠ thinking didn't close — try Disable Thinking' The HTTP API path intentionally drops the stats sentinel today (existing behavior — clients only see OpenAI/Ollama-shaped JSON), so this is a chat-UI-only surface for now. Live verified end-to-end on DSV4-Flash JANGTQ at HEAD with the exact repro prompt from the upstream c992df9 commit: prompt='Give me a random 20-digit number. Only return the number itself.' → log line: 'genTokens=256 stop=length unclosedReasoning=true' Tests: - 7 new StreamingStatsHint cases (encode/decode roundtrip with the new flag; legacy 2-field payload still decodes; unknown future flags are ignored without breaking parse). - 1 new GenerationEventMapper case (vmlx-set unclosedReasoning propagates through the bridge). - Full suite: 1027/1027 pass. Pin: bumped to vmlx-swift-lm c992df9 (the commit that adds the flag). Pairs with — but does NOT need — QwenThinkingProfile already defaulting disableThinking=true for Qwen3.x; this fallback exists for the case where the user explicitly opts INTO thinking and the model's training nonetheless traps. * fix(deps): bump vmlx-swift-lm pin to a7db6e5 (closes Metal-assertion crash on cache hit) Closes the deterministic `notifyExternalReferencesNonZeroOnDealloc` crash class that fired in `BatchEngine.stepPrefill` immediately after `Cache disk hit` on the second request to the same model. Two upstream commits land that fix together: - 98289d9 forces eager MLX async-eval on `slot.cache` after disk restore so lazy-restore graph ops don't extend into the next request's command buffer. - a7db6e5 sets `continuation.onTermination` on `BatchEngine.generate`'s outStream so orphan slots get reaped via `engine.cancel(requestId)` whenever a consumer breaks early. Net effect: the `Task.isCancelled` break in MLXBatchAdapter line 209-222 is safe, no `enableDiskCache: false` workaround is needed, and the cache disk-hit fast path is now usable on back-to-back requests. Live verification (this commit, exhaustive 5-phase matrix): Phase 1 (exact bug report repro): qwen3.6-27b-mxfp4-crack, 2x identical prompts, request B finished cleanly in 6s (vs 10s cold A). Phase 2 (sustained loop): 5x back-to-back identical, app pid steady, RSS stable. Phase 3 (orphan-slot pattern): 3x cycles of cancel-mid-stream + immediate retry, no crash on any cycle. Phase 4 (hybrid SSM): holo3-35b-a3b-jangtq4, prompt-tps jumped 75→482 (6.3x) on cache hit, no crash. Phase 5 (other model class): deepseekv4-flash-jangtq A+B identical, both completed cleanly. Across 15+ generations spanning 3 model_type families (qwen3_5_moe / qwen3_5_moe+SSM / deepseek_v4) and ~3.5 minutes of app uptime: ZERO Metal assertions, zero process exits, zero `destroyed while still required` lines in stderr or syslog. Also includes c992df9's `GenerateCompletionInfo.unclosedReasoning` for the trapped-thinking chip — same pin lineage. * ci: revert release-drafter v7 -> v6 (target_commitish API rejection) PR-event runs of release-drafter@v7 fail with: Validation Failed: {"resource":"Release","code":"invalid", "field":"target_commitish"} Triggered by v7's `Updating existing release...` step which sends `target_commitish: refs/pull/<N>/merge` — a value GitHub's REST `/releases` API rejects (must be a real branch). v6 omits the field on PR runs and only updates the draft body, which is what we want for the PR-time preview. Empirical evidence: PR #944 / #951 / #956 — passed on v6 (the prior pin). PR #953 — failed on v7 (introduced by commit 56c4dc7 "fix release drafter" on main 2026-04-27, which mistakenly upgraded the action while attempting to fix something else). Revert to v6 restores green CI. The push-to-main run keeps working on either version because that path provides a real branch ref.
Summary
fixed flaky tests
Checklist
CONTRIBUTING.md