Skip to content

Refactor: move fanout wiring to orchestrator, drop wiring queue#1264

Open
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:orch-side-wiring
Open

Refactor: move fanout wiring to orchestrator, drop wiring queue#1264
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:orch-side-wiring

Conversation

@ChaoWao

@ChaoWao ChaoWao commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Moves all task fanout wiring from the scheduler's deferred wiring queue onto the orchestrator, and deletes the wiring-queue subsystem entirely.

  • The orchestrator now wires each task inline at submit time: dep_pool.ensure_space (block / reclaim / overflow-detect) → wire_task. This mirrors the fanin_pool.ensure_space already on the submit path.
  • dep_pool becomes orchestrator-exclusive (single writer) → no lock needed. The dep_pool_lock added in optimize: prewire task dependencies on orchestrator side #1263 was only required by the hybrid O+S design; a single-wirer model is lock-free. The only cross-thread serialization remains the per-producer fanout_lock, shared with the scheduler's completion handler.
  • Deleted: PTO2SpscQueue, WiringState, drain_wiring_queue, report_wiring_deadlock, producer_blocked, orch_needs_drain, dep_deadlock_reported, and the Wire swimlane phase.
  • Forward-progress safety moves from the S-side structural wiring-deadlock detector to PTO2DepListPool::ensure_space's existing no-progress overflow detector (same PTO2_ERROR_DEP_POOL_OVERFLOW fatal).

Why

A single wirer gives a simpler model — lock-free dep_pool, no SPSC queue, no batch/backoff machinery. The benchmark below tests whether the wiring queue's batching was still pulling its weight.

Measured impact (a2a3 onboard, device 2, 100 rounds, trimmed mean; vs mainline c4cfc968)

Example Effective Δ Device Δ Orch Δ
qwen3_14b_decode (StressBatch16Seq3500) −0.1% −0.4% +11.2%
paged_attention_unroll (Case1 / Case2) −1.1% / −0.5% ~flat +14.7% / +2.8%
paged_attention_unroll_manual_scope (Case1 / Case2) −0.8% / +0.8% ~flat +11.4% / +6.2%
alternating_matmul_add (Case1) +4.6% ⚠️ +2.7% +4.7%
benchmark_bgemm (Case0) +9.4% ⚠️⚠️ +5.6% +13.8%
batch_paged_attention (Case1) +5.6% ⚠️ +4.4% +9.2%

Neutral on S-bound workloads (incl. qwen3); regresses O-bound ones (where Orch ≈ Effective) by 4.6–9.4%. Orch rises everywhere because the wiring work shifts S→O; where the orchestrator was already the bottleneck, that increase lands directly on Effective. The wiring queue's batching was load-bearing for those cases.

This is opened to get CI/review input on the full-removal approach versus the hybrid in #1263 (keep the queue + add O-side fast paths), which avoids these regressions. It is not a clear net win as-is.

Testing

  • a2a3sim tensormap_and_ringbuffer suite — 30 passed, 1 skipped.
  • Builds clean (a2a3 sim + onboard runtimes); clang-format + markdownlint pass.
  • cpput (hardware-gated) — pending CI.
  • Onboard benchmark vs mainline (table above); test_spsc_queue.cpp (a2a3) removed, drain cases dropped from test_wiring.cpp (wire_task cases retained).

Known caveat

PTO2_SERIAL_ORCH_SCHED=1 (default-off, debug-only) now requires a graph's in-flight fanout to fit in dep_pool, since no task completes during orchestration to reclaim it. Documented in RUNTIME_LOGIC.md.

…ueue

The orchestrator now wires each task's fanout edges inline at submit time
(dep_pool.ensure_space + wire_task) instead of pushing to a deferred wiring
queue drained by scheduler thread 0. The whole wiring-queue subsystem is
deleted: PTO2SpscQueue, WiringState, drain_wiring_queue, report_wiring_deadlock,
producer_blocked, orch_needs_drain, dep_deadlock_reported, and the Wire
swimlane phase.

- dep_pool becomes orchestrator-exclusive (single writer), so it needs no lock;
  the only cross-thread serialization is the existing per-producer fanout_lock,
  shared with the scheduler's completion handler.
- Forward-progress safety moves from the S-side structural wiring-deadlock
  detector to PTO2DepListPool::ensure_space's existing no-progress overflow
  detector (same PTO2_ERROR_DEP_POOL_OVERFLOW fatal), mirroring the
  fanin_pool.ensure_space already used on the submit path.
- wait_for_tensor_ready's orch_needs_drain signal is removed: with O wiring
  every task at submit time, the producer it waits on is already wired, and the
  scheduler dispatch loop is a busy poll that needs no "drain now" nudge.

Docs (RUNTIME_LOGIC.md, MULTI_RING.md, SUBMIT_BY_CLUSTER.md) updated to the
O-side wiring model. test_spsc_queue.cpp (a2a3) removed; the drain_wiring_queue
cases dropped from test_wiring.cpp (wire_task cases retained).

Testing: a2a3sim tensormap_and_ringbuffer suite — 30 passed, 1 skipped.
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR removes the scheduler's SPSC-based wiring queue subsystem and moves fanout wiring inline into the orchestrator's task submission path via a direct wire_task call. Associated profiling counters, the "wire" scheduler phase, tests, and documentation are updated to match the new inline model.

Changes

Inline Fanout Wiring Replaces Scheduler Wiring Queue

Layer / File(s) Summary
Scheduler/type contracts
.../pto_runtime2_types.h, .../scheduler/pto_scheduler.h
Removes PTO2_WRIRING_QUEUE_SIZE, the inline PTO2SpscQueue type, wiring-related layout fields, and WiringState; adds push_ready_routed helper and updates dep_pool ownership comments.
Orchestrator inline wiring
.../runtime/pto_orchestrator.cpp
submit_task_common ensures dep_pool capacity and calls sched->wire_task directly instead of pushing to a wiring queue with spin-wait/deadlock detection.
Runtime lifecycle updates
.../shared/pto_runtime2_init.cpp, .../runtime/pto_runtime2.{h,cpp}
Init/reset/destroy and arena-wiring paths no longer manage wiring.queue; wait_for_tensor_ready drops orch_needs_drain signaling.
Dispatch/cold-path cleanup
.../scheduler/scheduler_dispatch.cpp, .../scheduler/scheduler_cold_path.cpp, .../scheduler/scheduler_context.h
Removes the wiring-queue drain dispatch phase, its profiling emission, and related shutdown/wait-loop logic.
Profiling/trace phase removal
.../l2_swimlane_profiling.h, .../l2_swimlane_collector.cpp, .../scheduler/scheduler_types.h, simpler_setup/tools/swimlane_converter.py
Removes the Wire phase kind, its string mapping, and wiring cycle counters; adds a dummy phase color in the trace converter.
Test removal
tests/ut/cpp/a2a3/test_wiring.cpp, tests/ut/cpp/a2a3/test_spsc_queue.cpp, tests/ut/cpp/CMakeLists.txt
Removes drain_wiring_queue tests, the entire SPSC queue test file, and its CTest registration.
Documentation updates
.../docs/MULTI_RING.md, .../docs/RUNTIME_LOGIC.md, .../docs/SUBMIT_BY_CLUSTER.md
Describes orchestrator-inline fanout wiring and dep_pool reclamation replacing the global wiring queue.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Orchestrator as PTO2OrchestratorState
  participant DepPool as dep_pool
  participant Scheduler as sched (wire_task)
  participant ReadyQueue as push_ready_routed

  Orchestrator->>DepPool: ensure_space(fanin_actual_count)
  DepPool-->>Orchestrator: PTO2_ERROR_DEP_POOL_OVERFLOW (fatal on failure)
  Orchestrator->>Scheduler: wire_task(rss, slot_state, wfanin)
  Scheduler->>ReadyQueue: push_ready_routed(slot) when ready
Loading

Possibly related PRs

  • hw-native-sys/simpler#942: Both modify L2SwimlaneSchedPhaseKind and its host JSON formatter around the Wire phase kind.
  • hw-native-sys/simpler#1108: Both change L2 swimlane scheduler phase handling for the Wire discriminator across enum and emission logic.
  • hw-native-sys/simpler#1141: Both touch the same wiring-queue/dep-pool deadlock reporting code paths in pto_scheduler.h, in opposite directions.

Poem

A rabbit hopped through queues so tight,
Untangled wires, set fanout right,
No more spinning, no more wait,
Straight to wire_task, sealing fate,
Hop, skip, submit — the graph's in flight! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main refactor: moving fanout wiring to the orchestrator and removing the wiring queue.
Description check ✅ Passed The description matches the changeset and accurately explains the orchestrator wiring refactor and queue removal.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the fanout wiring mechanism by removing the deferred SPSC wiring queue and performing fanout wiring inline on the orchestrator side at submit time. This change simplifies the scheduler by removing the wire phase, the PTO2SpscQueue structure, and its associated tests. Feedback on the changes highlights two high-severity issues where signed integer parameters (wfanin and dep_pool_capacities[r]) are used in memory operations or cast to size_t without validating that they are non-negative, which could lead to out-of-bounds writes or signed-to-unsigned conversion overflows.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +879 to +884
int32_t wfanin = payload.fanin_actual_count;
if (wfanin > 0 && !rss.dep_pool.ensure_space(*rss.ring, wfanin)) {
orch_mark_fatal(orch, PTO2_ERROR_DEP_POOL_OVERFLOW);
return result;
}
sched->wiring.producer_blocked.store(0, std::memory_order_release);
sched->wire_task(rss, &cur_slot_state, wfanin);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The signed 32-bit integer wfanin (derived from payload.fanin_actual_count) is used as a count parameter for ensure_space and wire_task without validation that it is non-negative. If wfanin is negative, the wfanin > 0 check is bypassed, but wire_task is still called with a negative value. Inside wire_task, wfanin != 0 is true, which will attempt to allocate entries from dep_pool via dep_pool.prepend without having reserved space, potentially causing dependency pool corruption or out-of-bounds writes.

Per defensive programming guidelines, always validate that signed count parameters are non-negative before using them in bounds checks or memory operations.

        int32_t wfanin = payload.fanin_actual_count;
        always_assert(wfanin >= 0);
        if (wfanin > 0 && !rss.dep_pool.ensure_space(*rss.ring, wfanin)) {
            orch_mark_fatal(orch, PTO2_ERROR_DEP_POOL_OVERFLOW);
            return result;
        }
        sched->wire_task(rss, &cur_slot_state, wfanin);
References
  1. When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.

Comment on lines 147 to 148
layout.off_dep_pool_entries[r] =
arena.reserve(static_cast<size_t>(dep_pool_capacities[r]) * sizeof(PTO2DepListEntry), PTO2_ALIGN_SIZE);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The signed 32-bit integer dep_pool_capacities[r] is cast to size_t and used as a size parameter in arena.reserve without validating that it is non-negative. If dep_pool_capacities[r] is negative, casting it to size_t will cause a signed-to-unsigned conversion overflow, resulting in an extremely large allocation request or integer overflow in the multiplication, potentially bypassing bounds checks and causing heap corruption or out-of-bounds writes.

Per defensive programming guidelines, always validate that signed count/size parameters are non-negative before using them in memory operations.

        always_assert(dep_pool_capacities[r] >= 0);
        layout.off_dep_pool_entries[r] =
            arena.reserve(static_cast<size_t>(dep_pool_capacities[r]) * sizeof(PTO2DepListEntry), PTO2_ALIGN_SIZE);
References
  1. When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp (1)

464-471: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Stale comment: "fanin_count is set by scheduler during wiring".

Wiring is now performed inline by the orchestrator (per the updated comments immediately above at Lines 448-455 and the Step 6 rewrite at Lines 868-884), not deferred to a scheduler thread. This leftover comment on Line 470 contradicts the new model and should be updated for consistency.

📝 Proposed fix
-    // fanin_count is set by scheduler during wiring
+    // fanin_count is set by wire_task during inline orchestrator-side wiring
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`
around lines 464 - 471, The comment on the slot setup in pto_orchestrator.cpp is
stale and contradicts the current inline wiring flow. Update the note near
out->slot_state and scope_tasks_push in the orchestrator path to reflect that
wiring now happens inside the orchestrator (consistent with the comments around
the earlier setup and the Step 6 rewrite), and remove any mention that
fanin_count is set by a scheduler during wiring.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h (1)

1057-1061: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Stale comment: "wiring SPSC buffer" no longer exists.

reserve_layout's Phase 1 doc still lists "wiring SPSC buffer" as one of the sub-regions declared, but this PR removes that subsystem entirely (per PTO2SchedulerLayout at Line 404-411). This leftover reference will mislead future readers about what reserve_layout actually reserves.

📝 Proposed fix
-    // Phase 1: declare every sub-region (ready_queue slots, dummy queue slots,
-    // per-ring dep_pool entries, wiring SPSC buffer) on the supplied arena.
+    // Phase 1: declare every sub-region (ready_queue slots, dummy queue slots,
+    // per-ring dep_pool entries) on the supplied arena.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`
around lines 1057 - 1061, The Phase 1 comment for reserve_layout is stale
because it still mentions the removed wiring SPSC buffer even though
PTO2SchedulerLayout no longer includes that subsystem. Update the doc comment
near reserve_layout to describe only the regions that are actually reserved now
(for example ready_queue slots, dummy queue slots, and per-ring dep_pool
entries) and remove any reference to wiring SPSC buffer so the comment matches
the current layout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 464-471: The comment on the slot setup in pto_orchestrator.cpp is
stale and contradicts the current inline wiring flow. Update the note near
out->slot_state and scope_tasks_push in the orchestrator path to reflect that
wiring now happens inside the orchestrator (consistent with the comments around
the earlier setup and the Step 6 rewrite), and remove any mention that
fanin_count is set by a scheduler during wiring.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`:
- Around line 1057-1061: The Phase 1 comment for reserve_layout is stale because
it still mentions the removed wiring SPSC buffer even though PTO2SchedulerLayout
no longer includes that subsystem. Update the doc comment near reserve_layout to
describe only the regions that are actually reserved now (for example
ready_queue slots, dummy queue slots, and per-ring dep_pool entries) and remove
any reference to wiring SPSC buffer so the comment matches the current layout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 993e3b22-c49a-49f4-a8cd-fb7a4e217c49

📥 Commits

Reviewing files that changed from the base of the PR and between c26fe43 and 0712132.

📒 Files selected for processing (19)
  • simpler_setup/tools/swimlane_converter.py
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
  • tests/ut/cpp/CMakeLists.txt
  • tests/ut/cpp/a2a3/test_spsc_queue.cpp
  • tests/ut/cpp/a2a3/test_wiring.cpp
💤 Files with no reviewable changes (9)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • tests/ut/cpp/a2a3/test_spsc_queue.cpp
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
  • tests/ut/cpp/CMakeLists.txt
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • tests/ut/cpp/a2a3/test_wiring.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant