Skip to content

fix flaky tests and updated headers for sqlite collision prevention#956

Merged
tpae merged 5 commits into
mainfrom
chore/fix-flaky-tests
Apr 27, 2026
Merged

fix flaky tests and updated headers for sqlite collision prevention#956
tpae merged 5 commits into
mainfrom
chore/fix-flaky-tests

Conversation

@tpae
Copy link
Copy Markdown
Contributor

@tpae tpae commented Apr 27, 2026

Summary

fixed flaky tests

Checklist

  • I have read CONTRIBUTING.md
  • I added/updated tests where reasonable
  • I updated docs/README as needed
  • I verified build on macOS with Xcode 16.4+

@mimeding
Copy link
Copy Markdown
Contributor

CI update from checking this against #954/#955: #956 still fails test-core.

Observed failures in run https://github.com/osaurus-ai/osaurus/actions/runs/24996830968/job/73197126271:

  • SandboxManagerCleanupTests.cleanupAfterFailure_removesStaleContainerDirectory
  • PluginCreatorInjectionTests.composeChatContext_injectsPluginCreatorWhenCatalogEmpty

The direct pluginCreatorSkillSection_returnsContentWhenSkillEnabled case passed in this run, so the remaining plugin failure looks specific to the composeChatContext gate around ToolRegistry.shared.dynamicCatalogIsEmpty() / process-wide registry state rather than the built-in skill being unavailable.

Two concrete follow-ups that look worth checking:

  • The sandbox cleanup tests now use SandboxTestLock, but cleanupAfterFailure_removesStaleContainerDirectory still uses OsaurusPaths.container() through staleContainerDir. It may also need StoragePathsTestLock or a temp OsaurusPaths.overrideRoot so concurrent storage/path tests cannot move the cleanup target while the test is suspended.
  • The dynamic-catalog lock should cover every test that registers/enables/unregisters dynamic tools or redirects ToolConfigurationStore.overrideDirectory. In the current patch, the MCP tests set ToolConfigurationStore.overrideDirectory = tempDir inside DynamicCatalogTestLock, but I don’t see a restore of the previous override or temp cleanup; that global leak can affect later prompt-composition tests even when the tool itself is unregistered.

The run used Xcode 26.3. That was not the immediate failure here because the package built and tests launched, but #954/#955 already hit an unrelated EventSource transitive-module build failure on 26.3 before I aligned their CI pins to 26.4.1. After the flake fixes are green, it may still be worth aligning this branch/main CI with the release workflow’s Xcode pin.

@tpae tpae merged commit 3084c50 into main Apr 27, 2026
8 of 9 checks passed
@tpae tpae deleted the chore/fix-flaky-tests branch April 27, 2026 16:54
jjang-ai added a commit that referenced this pull request Apr 28, 2026
PR-event runs of release-drafter@v7 fail with:
  Validation Failed: {"resource":"Release","code":"invalid",
                     "field":"target_commitish"}

Triggered by v7's `Updating existing release...` step which sends
`target_commitish: refs/pull/<N>/merge` — a value GitHub's REST
`/releases` API rejects (must be a real branch). v6 omits the field
on PR runs and only updates the draft body, which is what we want for
the PR-time preview.

Empirical evidence:
  PR #944 / #951 / #956 — passed on v6 (the prior pin).
  PR #953 — failed on v7 (introduced by commit 56c4dc7 "fix release
            drafter" on main 2026-04-27, which mistakenly upgraded the
            action while attempting to fix something else).

Revert to v6 restores green CI. The push-to-main run keeps working on
either version because that path provides a real branch ref.
jjang-ai added a commit that referenced this pull request Apr 28, 2026
…correct (#953)

* fix(preflight): detect mislabeled JANGTQ bundles (sidecar present + non-mxtq stamp)

Bundles that ship jangtq_runtime.safetensors but stamp jang_config.json
as weight_format: "bf16" (or omit the field entirely) cause vmlx to
dispatch to the BASE DeepseekV4Model / MiniMaxModel / etc. class. The
safetensors then carry tq_norms / tq_packed tensors the base parameter
loader can't decode, surfacing as:

  Error: Unhandled keys ["tq_norms", "tq_packed"] in
  model.layers.0.mlp.switch_mlp.up_proj in
  DeepseekV4Model...DeepseekV4MoE.SwitchGLU.SwitchLinear

Live-confirmed 2026-04-25 on an early DSV4-Flash JANGTQ bundle stamped
"bf16" — the case the vmlx integration doc explicitly notes via the
DSV4_FORCE_JANGTQ=1 env-var workaround.

This commit extends validateJANGTQSidecarIfRequired with the inverse
mismatch check: sidecar present + weight_format != "mxtq" → throw a
clear remediation error instead of letting vmlx consume 60+ shards
before failing on Unhandled keys.

Genuine bf16 dense bundles (DSV4-Flash-JANG_2L, etc.) ship NO sidecar
and continue to pass through. Three new tests cover both mislabel
shapes (bf16+sidecar, absent+sidecar) plus a regression test confirming
the bf16-no-sidecar happy path still works.

* deps: pin vmlx-swift-lm by revision instead of branch (per tpae's new convention)

Switch from `branch: "main"` to `revision: "c992df9..."` in
Packages/OsaurusCore/Package.swift, and strip the now-redundant
`branch: "main"` field from osaurus.xcworkspace's Package.resolved
`state` block. SwiftPM keeps the revision-only state when the source
declaration uses revision pinning; collaborators get the matching
lock file by re-resolving the workspace (Xcode 'File ▸ Packages ▸
Reset Package Caches' or `xcodebuild -resolvePackageDependencies`).

Why: prevents unwanted upgrades from the upstream main branch moving,
and makes local development against vmlx hashes deterministic across
contributors. Branch-based pinning silently picked up new commits
on every package refresh.

Build verified: `xcodebuild build` against the workspace succeeds at
this pin.

* feat(reasoning): surface vmlx unclosedReasoning to chat UI as a trapped-thinking warning

Wires vmlx-swift-lm c992df9's GenerateCompletionInfo.unclosedReasoning
flag through the osaurus runtime + chat UI. Reasoning-trained Qwen3.6-A3B
and DSV4 fine-tunes can get trapped doing meta-cognitive self-validation
loops on prompts like 'give me a 20-digit number' — the model emits
[Final Check] / Self-Correction tokens forever and never closes
</think>, so the visible content channel stays empty while the answer
sits buried in the reasoning pane.

vmlx now detects this case (insideReasoning at stream end) and surfaces
it on Generation.info(GenerateCompletionInfo.unclosedReasoning). This
commit threads that signal end-to-end:

  vmlx GenerateCompletionInfo.unclosedReasoning
    ↓
  GenerationEventMapper passthrough (logs it on the [perf] mlxStats line)
    ↓
  ModelRuntimeEvent.completionInfo(..., unclosedReasoning: Bool)
    ↓
  ModelRuntime.streamWithTools encodes via StreamingStatsHint
    (wire format: "<sentinel>stats:<count>;<tps>[;unclosed]" — third
    field is comma-separated flags so future signals stay backward-compat)
    ↓
  ChatView decodes + persists onto ChatTurn.unclosedReasoning
    ↓
  ContentBlock.generationStats carries the flag to the cell renderer
    ↓
  NativeStatsView appends a one-line warning chip beside the tok/s
    counter:  '⚠ thinking didn't close — try Disable Thinking'

The HTTP API path intentionally drops the stats sentinel today (existing
behavior — clients only see OpenAI/Ollama-shaped JSON), so this is a
chat-UI-only surface for now.

Live verified end-to-end on DSV4-Flash JANGTQ at HEAD with the exact
repro prompt from the upstream c992df9 commit:
  prompt='Give me a random 20-digit number. Only return the number itself.'
  → log line: 'genTokens=256 stop=length unclosedReasoning=true'

Tests:
  - 7 new StreamingStatsHint cases (encode/decode roundtrip with the
    new flag; legacy 2-field payload still decodes; unknown future
    flags are ignored without breaking parse).
  - 1 new GenerationEventMapper case (vmlx-set unclosedReasoning
    propagates through the bridge).
  - Full suite: 1027/1027 pass.

Pin: bumped to vmlx-swift-lm c992df9 (the commit that adds the flag).

Pairs with — but does NOT need — QwenThinkingProfile already defaulting
disableThinking=true for Qwen3.x; this fallback exists for the case
where the user explicitly opts INTO thinking and the model's training
nonetheless traps.

* fix(deps): bump vmlx-swift-lm pin to a7db6e5 (closes Metal-assertion crash on cache hit)

Closes the deterministic `notifyExternalReferencesNonZeroOnDealloc`
crash class that fired in `BatchEngine.stepPrefill` immediately after
`Cache disk hit` on the second request to the same model. Two upstream
commits land that fix together:

  - 98289d9 forces eager MLX async-eval on `slot.cache` after disk
    restore so lazy-restore graph ops don't extend into the next
    request's command buffer.
  - a7db6e5 sets `continuation.onTermination` on
    `BatchEngine.generate`'s outStream so orphan slots get reaped via
    `engine.cancel(requestId)` whenever a consumer breaks early.

Net effect: the `Task.isCancelled` break in MLXBatchAdapter line
209-222 is safe, no `enableDiskCache: false` workaround is needed,
and the cache disk-hit fast path is now usable on back-to-back requests.

Live verification (this commit, exhaustive 5-phase matrix):

  Phase 1 (exact bug report repro): qwen3.6-27b-mxfp4-crack, 2x
    identical prompts, request B finished cleanly in 6s (vs 10s cold A).
  Phase 2 (sustained loop): 5x back-to-back identical, app pid steady,
    RSS stable.
  Phase 3 (orphan-slot pattern): 3x cycles of cancel-mid-stream +
    immediate retry, no crash on any cycle.
  Phase 4 (hybrid SSM): holo3-35b-a3b-jangtq4, prompt-tps jumped
    75→482 (6.3x) on cache hit, no crash.
  Phase 5 (other model class): deepseekv4-flash-jangtq A+B identical,
    both completed cleanly.

Across 15+ generations spanning 3 model_type families
(qwen3_5_moe / qwen3_5_moe+SSM / deepseek_v4) and ~3.5 minutes of app
uptime: ZERO Metal assertions, zero process exits, zero
`destroyed while still required` lines in stderr or syslog.

Also includes c992df9's `GenerateCompletionInfo.unclosedReasoning` for
the trapped-thinking chip — same pin lineage.

* ci: revert release-drafter v7 -> v6 (target_commitish API rejection)

PR-event runs of release-drafter@v7 fail with:
  Validation Failed: {"resource":"Release","code":"invalid",
                     "field":"target_commitish"}

Triggered by v7's `Updating existing release...` step which sends
`target_commitish: refs/pull/<N>/merge` — a value GitHub's REST
`/releases` API rejects (must be a real branch). v6 omits the field
on PR runs and only updates the draft body, which is what we want for
the PR-time preview.

Empirical evidence:
  PR #944 / #951 / #956 — passed on v6 (the prior pin).
  PR #953 — failed on v7 (introduced by commit 56c4dc7 "fix release
            drafter" on main 2026-04-27, which mistakenly upgraded the
            action while attempting to fix something else).

Revert to v6 restores green CI. The push-to-main run keeps working on
either version because that path provides a real branch ref.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants