fix flaky tests and updated headers for sqlite collision prevention by tpae · Pull Request #956 · osaurus-ai/osaurus

tpae · 2026-04-27T12:35:06Z

Summary

fixed flaky tests

Checklist

I have read CONTRIBUTING.md
I added/updated tests where reasonable
I updated docs/README as needed
I verified build on macOS with Xcode 16.4+

mimeding · 2026-04-27T13:28:12Z

CI update from checking this against #954/#955: #956 still fails test-core.

Observed failures in run https://github.com/osaurus-ai/osaurus/actions/runs/24996830968/job/73197126271:

SandboxManagerCleanupTests.cleanupAfterFailure_removesStaleContainerDirectory
PluginCreatorInjectionTests.composeChatContext_injectsPluginCreatorWhenCatalogEmpty

The direct pluginCreatorSkillSection_returnsContentWhenSkillEnabled case passed in this run, so the remaining plugin failure looks specific to the composeChatContext gate around ToolRegistry.shared.dynamicCatalogIsEmpty() / process-wide registry state rather than the built-in skill being unavailable.

Two concrete follow-ups that look worth checking:

The sandbox cleanup tests now use SandboxTestLock, but cleanupAfterFailure_removesStaleContainerDirectory still uses OsaurusPaths.container() through staleContainerDir. It may also need StoragePathsTestLock or a temp OsaurusPaths.overrideRoot so concurrent storage/path tests cannot move the cleanup target while the test is suspended.
The dynamic-catalog lock should cover every test that registers/enables/unregisters dynamic tools or redirects ToolConfigurationStore.overrideDirectory. In the current patch, the MCP tests set ToolConfigurationStore.overrideDirectory = tempDir inside DynamicCatalogTestLock, but I don’t see a restore of the previous override or temp cleanup; that global leak can affect later prompt-composition tests even when the tool itself is unregistered.

The run used Xcode 26.3. That was not the immediate failure here because the package built and tests launched, but #954/#955 already hit an unrelated EventSource transitive-module build failure on 26.3 before I aligned their CI pins to 26.4.1. After the flake fixes are green, it may still be worth aligning this branch/main CI with the release workflow’s Xcode pin.

PR-event runs of release-drafter@v7 fail with: Validation Failed: {"resource":"Release","code":"invalid", "field":"target_commitish"} Triggered by v7's `Updating existing release...` step which sends `target_commitish: refs/pull/<N>/merge` — a value GitHub's REST `/releases` API rejects (must be a real branch). v6 omits the field on PR runs and only updates the draft body, which is what we want for the PR-time preview. Empirical evidence: PR #944 / #951 / #956 — passed on v6 (the prior pin). PR #953 — failed on v7 (introduced by commit 56c4dc7 "fix release drafter" on main 2026-04-27, which mistakenly upgraded the action while attempting to fix something else). Revert to v6 restores green CI. The push-to-main run keeps working on either version because that path provides a real branch ref.

…correct (#953) * fix(preflight): detect mislabeled JANGTQ bundles (sidecar present + non-mxtq stamp) Bundles that ship jangtq_runtime.safetensors but stamp jang_config.json as weight_format: "bf16" (or omit the field entirely) cause vmlx to dispatch to the BASE DeepseekV4Model / MiniMaxModel / etc. class. The safetensors then carry tq_norms / tq_packed tensors the base parameter loader can't decode, surfacing as: Error: Unhandled keys ["tq_norms", "tq_packed"] in model.layers.0.mlp.switch_mlp.up_proj in DeepseekV4Model...DeepseekV4MoE.SwitchGLU.SwitchLinear Live-confirmed 2026-04-25 on an early DSV4-Flash JANGTQ bundle stamped "bf16" — the case the vmlx integration doc explicitly notes via the DSV4_FORCE_JANGTQ=1 env-var workaround. This commit extends validateJANGTQSidecarIfRequired with the inverse mismatch check: sidecar present + weight_format != "mxtq" → throw a clear remediation error instead of letting vmlx consume 60+ shards before failing on Unhandled keys. Genuine bf16 dense bundles (DSV4-Flash-JANG_2L, etc.) ship NO sidecar and continue to pass through. Three new tests cover both mislabel shapes (bf16+sidecar, absent+sidecar) plus a regression test confirming the bf16-no-sidecar happy path still works. * deps: pin vmlx-swift-lm by revision instead of branch (per tpae's new convention) Switch from `branch: "main"` to `revision: "c992df9..."` in Packages/OsaurusCore/Package.swift, and strip the now-redundant `branch: "main"` field from osaurus.xcworkspace's Package.resolved `state` block. SwiftPM keeps the revision-only state when the source declaration uses revision pinning; collaborators get the matching lock file by re-resolving the workspace (Xcode 'File ▸ Packages ▸ Reset Package Caches' or `xcodebuild -resolvePackageDependencies`). Why: prevents unwanted upgrades from the upstream main branch moving, and makes local development against vmlx hashes deterministic across contributors. Branch-based pinning silently picked up new commits on every package refresh. Build verified: `xcodebuild build` against the workspace succeeds at this pin. * feat(reasoning): surface vmlx unclosedReasoning to chat UI as a trapped-thinking warning Wires vmlx-swift-lm c992df9's GenerateCompletionInfo.unclosedReasoning flag through the osaurus runtime + chat UI. Reasoning-trained Qwen3.6-A3B and DSV4 fine-tunes can get trapped doing meta-cognitive self-validation loops on prompts like 'give me a 20-digit number' — the model emits [Final Check] / Self-Correction tokens forever and never closes </think>, so the visible content channel stays empty while the answer sits buried in the reasoning pane. vmlx now detects this case (insideReasoning at stream end) and surfaces it on Generation.info(GenerateCompletionInfo.unclosedReasoning). This commit threads that signal end-to-end: vmlx GenerateCompletionInfo.unclosedReasoning ↓ GenerationEventMapper passthrough (logs it on the [perf] mlxStats line) ↓ ModelRuntimeEvent.completionInfo(..., unclosedReasoning: Bool) ↓ ModelRuntime.streamWithTools encodes via StreamingStatsHint (wire format: "<sentinel>stats:<count>;<tps>[;unclosed]" — third field is comma-separated flags so future signals stay backward-compat) ↓ ChatView decodes + persists onto ChatTurn.unclosedReasoning ↓ ContentBlock.generationStats carries the flag to the cell renderer ↓ NativeStatsView appends a one-line warning chip beside the tok/s counter: '⚠ thinking didn't close — try Disable Thinking' The HTTP API path intentionally drops the stats sentinel today (existing behavior — clients only see OpenAI/Ollama-shaped JSON), so this is a chat-UI-only surface for now. Live verified end-to-end on DSV4-Flash JANGTQ at HEAD with the exact repro prompt from the upstream c992df9 commit: prompt='Give me a random 20-digit number. Only return the number itself.' → log line: 'genTokens=256 stop=length unclosedReasoning=true' Tests: - 7 new StreamingStatsHint cases (encode/decode roundtrip with the new flag; legacy 2-field payload still decodes; unknown future flags are ignored without breaking parse). - 1 new GenerationEventMapper case (vmlx-set unclosedReasoning propagates through the bridge). - Full suite: 1027/1027 pass. Pin: bumped to vmlx-swift-lm c992df9 (the commit that adds the flag). Pairs with — but does NOT need — QwenThinkingProfile already defaulting disableThinking=true for Qwen3.x; this fallback exists for the case where the user explicitly opts INTO thinking and the model's training nonetheless traps. * fix(deps): bump vmlx-swift-lm pin to a7db6e5 (closes Metal-assertion crash on cache hit) Closes the deterministic `notifyExternalReferencesNonZeroOnDealloc` crash class that fired in `BatchEngine.stepPrefill` immediately after `Cache disk hit` on the second request to the same model. Two upstream commits land that fix together: - 98289d9 forces eager MLX async-eval on `slot.cache` after disk restore so lazy-restore graph ops don't extend into the next request's command buffer. - a7db6e5 sets `continuation.onTermination` on `BatchEngine.generate`'s outStream so orphan slots get reaped via `engine.cancel(requestId)` whenever a consumer breaks early. Net effect: the `Task.isCancelled` break in MLXBatchAdapter line 209-222 is safe, no `enableDiskCache: false` workaround is needed, and the cache disk-hit fast path is now usable on back-to-back requests. Live verification (this commit, exhaustive 5-phase matrix): Phase 1 (exact bug report repro): qwen3.6-27b-mxfp4-crack, 2x identical prompts, request B finished cleanly in 6s (vs 10s cold A). Phase 2 (sustained loop): 5x back-to-back identical, app pid steady, RSS stable. Phase 3 (orphan-slot pattern): 3x cycles of cancel-mid-stream + immediate retry, no crash on any cycle. Phase 4 (hybrid SSM): holo3-35b-a3b-jangtq4, prompt-tps jumped 75→482 (6.3x) on cache hit, no crash. Phase 5 (other model class): deepseekv4-flash-jangtq A+B identical, both completed cleanly. Across 15+ generations spanning 3 model_type families (qwen3_5_moe / qwen3_5_moe+SSM / deepseek_v4) and ~3.5 minutes of app uptime: ZERO Metal assertions, zero process exits, zero `destroyed while still required` lines in stderr or syslog. Also includes c992df9's `GenerateCompletionInfo.unclosedReasoning` for the trapped-thinking chip — same pin lineage. * ci: revert release-drafter v7 -> v6 (target_commitish API rejection) PR-event runs of release-drafter@v7 fail with: Validation Failed: {"resource":"Release","code":"invalid", "field":"target_commitish"} Triggered by v7's `Updating existing release...` step which sends `target_commitish: refs/pull/<N>/merge` — a value GitHub's REST `/releases` API rejects (must be a real branch). v6 omits the field on PR runs and only updates the draft body, which is what we want for the PR-time preview. Empirical evidence: PR #944 / #951 / #956 — passed on v6 (the prior pin). PR #953 — failed on v7 (introduced by commit 56c4dc7 "fix release drafter" on main 2026-04-27, which mistakenly upgraded the action while attempting to fix something else). Revert to v6 restores green CI. The push-to-main run keeps working on either version because that path provides a real branch ref.

tpae added 2 commits April 27, 2026 05:12

fixed flaky test

3994713

extended collision protection

2648eb1

github-actions Bot added the chore label Apr 27, 2026

tpae mentioned this pull request Apr 27, 2026

[codex] Gate local thinking controls and SQLCipher import #954

Merged

9 tasks

added sandbox test lock to prevent races

eec2569

tpae added 2 commits April 27, 2026 07:12

fix race condition

d78e12c

refactor tests for predictability

8c3c3ce

tpae merged commit 3084c50 into main Apr 27, 2026
8 of 9 checks passed

tpae deleted the chore/fix-flaky-tests branch April 27, 2026 16:54

github-actions Bot added the pending release label Apr 27, 2026

github-actions Bot added released and removed pending release labels Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flaky tests and updated headers for sqlite collision prevention#956

fix flaky tests and updated headers for sqlite collision prevention#956
tpae merged 5 commits into
mainfrom
chore/fix-flaky-tests

tpae commented Apr 27, 2026

Uh oh!

mimeding commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tpae commented Apr 27, 2026

Summary

Checklist

Uh oh!

mimeding commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants