optimize: prewire task dependencies on orchestrator side by Crane-Liu · Pull Request #1263 · hw-native-sys/simpler

Crane-Liu · 2026-07-03T01:55:24Z

Summary

Move dependency wiring to the orchestrator side for the a2a3 and a5 tensormap_and_ringbuffer runtimes, and remove the old scheduler-side deferred wiring fallback.

The latest cleanup makes this PR a single Orch-side wiring design instead of a mixed O/S fallback path:

Zero-fanin tasks are published ready directly by Orch.
Fanin tasks whose producers are already complete are published ready directly by Orch.
Fanin tasks with live producers are wired by Orch into the dependency pool before scheduling.
Scheduler threads no longer drain a deferred wiring queue.
S0/S1/S2 can still complete dependency-only/dummy fanout in parallel.
a2a3 and a5 now follow the same converged structure.

Modified Code

a2a3 runtime

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
- Move no-fanin, completed-fanin, and live-fanin wiring/publish decisions into Orch submit.
- Remove fallback branches that sent dependency wiring back to the scheduler path.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
- Remove old SPSC wiring queue helpers and deferred wiring drain logic.
- Keep the shared fanin wiring helper used by Orch-side live-fanin wiring.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
- Remove the scheduler dispatch phase that drained deferred wiring work.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
- Align cold-path completion/fanout handling with the new Orch-wired dependency state.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
- Remove wiring queue layout/init/reset/destroy plumbing.
- Update runtime initialization around Orch-owned dependency wiring state.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.{cpp,h}
- Remove stale wiring queue APIs and state exposure.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
- Remove deprecated wiring state fields tied to the fallback queue.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
- Update ring/dependency-pool comments and ownership assumptions.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/{scheduler_context.h,scheduler_types.h}
- Remove unused scheduler-side wiring/drain fields.

a5 runtime

Applied the same convergence to the matching a5 files under src/a5/runtime/tensormap_and_ringbuffer/runtime/.
a5 now uses the same Orch-side submit/wiring model as a2a3 instead of carrying an independent fallback path.

Shared profiling / tooling

src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a5/platform/include/common/l2_swimlane_profiling.h
simpler_setup/tools/swimlane_converter.py

These were updated so the old scheduler-wire swimlane/profiling label is treated as legacy scheduler wire phase metadata rather than an active deferred wiring phase.

Documentation Updates

Updated both a2a3 and a5 runtime docs:

src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md

The docs now describe:

Orch-side dependency wiring as the primary path.
Removal of the scheduler-side deferred wiring queue fallback.
Updated ownership of dependency-pool writes.
The remaining scheduler responsibility: execute ready work and release fanout.
The fact that S0/S1/S2 can complete dependency-only/dummy fanout in parallel.

Test Updates

tests/ut/cpp/a2a3/test_wiring.cpp
tests/ut/cpp/a5/test_wiring.cpp

Updated wiring tests to exercise Orch-side helpers directly.

Removed obsolete tests for the deleted SPSC wiring queue:

tests/ut/cpp/a2a3/test_spsc_queue.cpp
tests/ut/cpp/a5/test_spsc_queue.cpp

Updated:

tests/ut/cpp/CMakeLists.txt

Current Behavior

wfanin == 0: Orch seeds fanin state, records the dep-pool position, and routes the task ready.
wfanin > 0 and all producers are already complete: Orch seeds fanin/dispatch state and routes the task ready.
wfanin > 0 with unfinished producers: Orch wires fanout entries through the shared wiring helper before the task becomes visible to schedulers.
Scheduler threads complete ready tasks and release fanout; they do not perform deferred wiring drain anymore.

Results

Previous 100-round Strategy Sweep

Benchmark: qwen3_14b_decode, StressBatch16Seq3500, a2a3, device 4, tensormap_and_ringbuffer, 100 rounds per run. Values below are trimmed averages, dropping 10 low and 10 high rounds.

Negative delta means faster.

Set	Device us	Device vs Baseline	Sched us	Sched vs Baseline	Orch us
Baseline (`simpler-base`)	2436.5	+0.0	2105.2	+0.0	383.3
S0/S1/S2 DummyDrain avg	2416.9	-19.6	2091.5	-13.7	-
O-side prewire avg	2400.4	-36.1	2075.2	-30.0	422.1

Readout:

Device improves by 36.1 us vs baseline, about 1.48%.
Scheduler completion improves by 30.0 us vs baseline, about 1.43%.
Compared with the S0/S1/S2 dummy-drain-only version, O-side prewire adds another 16.5 us Device improvement and 16.3 us Sched improvement.
Orch time increases by about 38.8 us vs baseline, which is the expected work shift from S to O. For qwen, O-side completion still has enough slack that the total Device/Sched time improves.

Latest pypto-lib qwen14b Decode 40L Check

Command:

task-submit --device 0 --max-time 0 --timeout 0 --run 'cd /data/pyptouser/zhangtao/zt/pypto-lb && .venv-bench/bin/python models/qwen3/14b/decode_layer.py -p a2a3 -d $TASK_DEVICE --validate-fwd --fwd-layers 40'

Result from after_lock_cleanup_qwen14b_40l_20260703_162637_dev0.log:

Metric	Time
`device_wall`	35.527200 ms
`orch`	31.077500 ms
`sched`	35.157720 ms
`runner_run`	37.523351 ms

Correctness:

argmax match 16/16
sample match 16/16
logits 100.0000% within 5e-2
max_abs_err=0.0199

Compared with the earlier device-0 baseline average for this 40L check, device_wall and sched are within about 0.5%, so the cleanup does not show an obvious performance regression.

Tests

git diff --check
cmake --build tests/ut/cpp/build -j$(nproc)
ctest --test-dir tests/ut/cpp/build --output-on-failure -R '^(test_wiring|test_a2a3_orchestrator_fanin|test_a5_wiring|test_a5_orchestrator_fanin)$'
pypto-lib qwen14b decode 40L validation:
- models/qwen3/14b/decode_layer.py -p a2a3 -d $TASK_DEVICE --validate-fwd --fwd-layers 40

Move selected wiring work to the orchestrator side while preserving scheduler fallback, ready-once routing, and concurrent dummy drain. Co-authored-by: TaoZQY <zhangtaolqy@mail.ustc.edu.cn>

Keep the O-side prewire and ready-once scheduler path while adopting upstream local-buffer removal. Co-authored-by: TaoZQY <zhangtaolqy@mail.ustc.edu.cn>

coderabbitai · 2026-07-03T01:55:42Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b89a8461-a2aa-41c7-9572-1919ecc15616

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR reworks the orchestrator/scheduler dependency-wiring pipeline: adding an explicit ready_state/mark_completed() slot-state model, inline fast-path wiring for zero-fanin/completed-producer tasks, a dep-pool spinlock guarding wiring, a new pending-ready wake-list discovery mechanism, and multi-thread dummy-queue draining.

Changes

Inline Wiring, Ready-State, and Wake-List Discovery

Layer / File(s)	Summary
Slot state: ready_state and mark_completed `pto_runtime2_types.h`	Adds `PTO2ReadyState`/`PTO2CompletionFlag` enums, a `ready_state` atomic, `mark_completed()`/`is_completion_flag_set()` helpers, and reuse reset for `ready_state`.
Orchestrator inline-wiring fast paths `pto_orchestrator.cpp`	Caches producer state for "gone" checks, adds helpers for completed-fanin detection, inline-ready routing, and prewiring under dep-pool lock; `submit_task_common` now wires inline before falling back to the queue; alloc/reuse paths use `mark_completed()`/`dep_pool_mark` reset.
Scheduler dep-pool locking and wiring queue backoff `scheduler/pto_scheduler.h`	Adds `dep_pool_lock` and lock/unlock helpers, wraps `drain_wiring_queue` dep-pool logic in the lock, and introduces a two-phase startup/steady backoff triggering `poll_pending_ready()`.
Wake-list pending-ready discovery mechanism `scheduler/pto_scheduler.h`	Adds `PTO2WakeWaiter`, layout wake-region fields, `PendingReadyEntry`/`PendingReadyState`/`WakeListState`, and helpers for claiming, scanning, registering, and polling pending-ready tasks.
Unified ready routing and completion wiring `scheduler/pto_scheduler.h`	Introduces `route_ready_once()` used by `wire_task`, `release_fanin_and_check_ready`, and `drain_wake_list`; `on_task_complete` now uses `mark_completed()` and conditionally drains wake lists when discovery is enabled.
Arena layout and init wiring for wake regions and dep_pool_lock `shared/pto_runtime2_init.cpp`	Extends `reserve_layout` with `task_window_sizes` and wake-region sizing, initializes `dep_pool_lock`, wires/resets `wake_lists`/`pending_ready` across init/reset/wire paths, and updates `runtime_reserve_layout` call site.
Dummy ready queue multi-thread draining `scheduler_dispatch.cpp`	Switches dummy queue draining from thread 0 only to threads 0–2, and lowers the drain batch size from 16 to 8.

Estimated code review effort: 5 (Critical) | ~110 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant submit_task_common
  participant all_claimed_fanin_completed
  participant try_orch_prewire_task
  participant route_ready_once
  participant WiringQueue

  Caller->>submit_task_common: submit task
  submit_task_common->>all_claimed_fanin_completed: check claimed producers
  alt zero fanin or all completed
    submit_task_common->>route_ready_once: route inline-ready
  else fanin pending
    submit_task_common->>try_orch_prewire_task: attempt prewire (dep_pool_lock)
    alt prewire succeeds
      try_orch_prewire_task-->>submit_task_common: wired
    else prewire fails
      submit_task_common->>WiringQueue: push task (spin if full)
    end
  end

sequenceDiagram
  participant Task
  participant on_task_complete
  participant WakeListState
  participant route_ready_once
  participant pending_ready

  Task->>on_task_complete: task finishes
  on_task_complete->>on_task_complete: slot_state.mark_completed()
  alt discovery enabled
    on_task_complete->>WakeListState: drain_wake_list
    WakeListState->>route_ready_once: route woken consumer
    route_ready_once->>pending_ready: requeue if fanin missing
  end

Possibly related PRs

hw-native-sys/simpler#1066: Both PRs modify fan-in/dependency handling in submit_task_common, one for producer dedup and the other for completion/"gone" status evaluation for inline wiring.
hw-native-sys/simpler#1141: Both PRs modify drain_wiring_queue dep-pool reclaim logic and the wiring-queue deadlock reporting section.
hw-native-sys/simpler#1245: Both PRs touch scheduler readiness/completion routing paths in pto_scheduler.h, including release_fanin_and_check_ready/on_task_complete.

Poem

A hop, a lock, a wake-list glow,
Inline paths let fast tasks go.
Ready-state bits now click in place,
Producers claimed at steady pace.
Three threads now drain the dummy queue —
This rabbit thumps approval too! 🐇✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly captures the main change: moving dependency prewiring to the orchestrator side.
Description check	✅ Passed	The description is directly about the same wiring refactor and scheduler/orchestrator behavior changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request optimizes the task scheduling and wiring pipeline by allowing the orchestrator to inline already-ready tasks and prewire tasks directly, bypassing the scheduler's wiring queue. To support this, thread-safe locking is added to the dependency pool. Additionally, a disabled-by-default "discovery" wake-up mechanism is introduced to optimize task wake-ups, and dummy ready queue draining is parallelized across multiple scheduler threads. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 906-925: Preserve the dep_pool_mark for inline-ready tasks in PTO
orchestrator flow: in the inline-ready branches around route_orch_inline_ready
and try_orch_prewire_task, the slot is being made ready without going through
wire_task(), so the oldest live slot can end up with a zero mark and block
PTO2DepListPool::reclaim(). Update the inline path to assign the same
dep_pool_mark that wire_task() would have established before marking the task
ready, using the existing fanin_builder/current slot state helpers in
pto_orchestrator.cpp so dep-pool tail advancement still works when an inline
task becomes the head of the live window.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73457185-c6e8-4a0a-b250-b01562f2d41a

📥 Commits

Reviewing files that changed from the base of the PR and between 153daee and 15fef19.

📒 Files selected for processing (5)

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp

Co-authored-by: Crane-Liu <c.wliu@outlook.com>

Move the live-fanin dep_pool wait out of PTO2DepListPool::ensure_space (fast 100k-spin fatal in <1ms) into the orchestrator's orch_wire_live_fanin_task, mirroring the fanin spill pool's ensure_space backstop: an absolute time budget plus a check for a fatal already latched elsewhere. With wiring now on the orchestrator, a workload that stalls the scheduler (e.g. the scheduler_timeout fatal-code test, which lowers PTO2_SCHEDULER_TIMEOUT_MS) used to see O fatal PTO2_ERROR_DEP_POOL_OVERFLOW (-4) at submit time within <1ms, masking the real scheduler-timeout code (-100). Now O bails without latching when sched_error_code / orch_error_code is already set, so the root-cause code surfaces. The residual dep_pool-deadlock backstop is sized to run strictly longer than the scheduler timeout (scheduler timeout + alloc-deadlock slack) so the scheduler's own fault is always observed first, instead of racing a fixed 500ms budget that ties with a 500ms scheduler timeout under load. Applied to a2a3 and a5. Test: tests/st/runtime_fatal_codes (sim, a2a3+a5) — scheduler_timeout surfaces -100, dep_pool_overflow still -4, all cases pass. Co-Authored-By: Claude <noreply@anthropic.com>

liuchangwen and others added 2 commits July 3, 2026 09:17

Optimize scheduler wiring with O-side prewire

2310d45

Move selected wiring work to the orchestrator side while preserving scheduler fallback, ready-once routing, and concurrent dummy drain. Co-authored-by: TaoZQY <zhangtaolqy@mail.ustc.edu.cn>

Merge upstream main into 20260702_PP

15fef19

Keep the O-side prewire and ready-once scheduler path while adopting upstream local-buffer removal. Co-authored-by: TaoZQY <zhangtaolqy@mail.ustc.edu.cn>

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated

ChaoWao mentioned this pull request Jul 3, 2026

Refactor: move fanout wiring to orchestrator, drop wiring queue #1264

Open

4 tasks

Crane-Liu force-pushed the 20260702_PP branch from 9a67b03 to 3b90ab6 Compare July 3, 2026 08:42

TaoZQY and others added 2 commits July 3, 2026 16:42

refactor: simplify orchestrator dependency wiring

3b90ab6

Co-authored-by: Crane-Liu <c.wliu@outlook.com>

ChaoWao force-pushed the 20260702_PP branch from 6105e80 to ea65ecd Compare July 5, 2026 02:27

test: drop unreachable scheduler-timeout fixture

5cb2e65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize: prewire task dependencies on orchestrator side#1263

optimize: prewire task dependencies on orchestrator side#1263
Crane-Liu wants to merge 5 commits into
hw-native-sys:mainfrom
Crane-Liu:20260702_PP

Crane-Liu commented Jul 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Crane-Liu commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Modified Code

a2a3 runtime

a5 runtime

Shared profiling / tooling

Documentation Updates

Test Updates

Current Behavior

Results

Previous 100-round Strategy Sweep

Latest pypto-lib qwen14b Decode 40L Check

Tests

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Crane-Liu commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading