Skip to content

optimize: prewire task dependencies on orchestrator side#1263

Open
Crane-Liu wants to merge 5 commits into
hw-native-sys:mainfrom
Crane-Liu:20260702_PP
Open

optimize: prewire task dependencies on orchestrator side#1263
Crane-Liu wants to merge 5 commits into
hw-native-sys:mainfrom
Crane-Liu:20260702_PP

Conversation

@Crane-Liu

@Crane-Liu Crane-Liu commented Jul 3, 2026

Copy link
Copy Markdown

Summary

Move dependency wiring to the orchestrator side for the a2a3 and a5 tensormap_and_ringbuffer runtimes, and remove the old scheduler-side deferred wiring fallback.

The latest cleanup makes this PR a single Orch-side wiring design instead of a mixed O/S fallback path:

  • Zero-fanin tasks are published ready directly by Orch.
  • Fanin tasks whose producers are already complete are published ready directly by Orch.
  • Fanin tasks with live producers are wired by Orch into the dependency pool before scheduling.
  • Scheduler threads no longer drain a deferred wiring queue.
  • S0/S1/S2 can still complete dependency-only/dummy fanout in parallel.
  • a2a3 and a5 now follow the same converged structure.

Modified Code

a2a3 runtime

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Move no-fanin, completed-fanin, and live-fanin wiring/publish decisions into Orch submit.
    • Remove fallback branches that sent dependency wiring back to the scheduler path.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
    • Remove old SPSC wiring queue helpers and deferred wiring drain logic.
    • Keep the shared fanin wiring helper used by Orch-side live-fanin wiring.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
    • Remove the scheduler dispatch phase that drained deferred wiring work.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
    • Align cold-path completion/fanout handling with the new Orch-wired dependency state.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
    • Remove wiring queue layout/init/reset/destroy plumbing.
    • Update runtime initialization around Orch-owned dependency wiring state.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.{cpp,h}
    • Remove stale wiring queue APIs and state exposure.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
    • Remove deprecated wiring state fields tied to the fallback queue.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
    • Update ring/dependency-pool comments and ownership assumptions.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/{scheduler_context.h,scheduler_types.h}
    • Remove unused scheduler-side wiring/drain fields.

a5 runtime

  • Applied the same convergence to the matching a5 files under src/a5/runtime/tensormap_and_ringbuffer/runtime/.
  • a5 now uses the same Orch-side submit/wiring model as a2a3 instead of carrying an independent fallback path.

Shared profiling / tooling

  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a5/platform/include/common/l2_swimlane_profiling.h
  • simpler_setup/tools/swimlane_converter.py

These were updated so the old scheduler-wire swimlane/profiling label is treated as legacy scheduler wire phase metadata rather than an active deferred wiring phase.

Documentation Updates

Updated both a2a3 and a5 runtime docs:

  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
  • src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
  • src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md

The docs now describe:

  • Orch-side dependency wiring as the primary path.
  • Removal of the scheduler-side deferred wiring queue fallback.
  • Updated ownership of dependency-pool writes.
  • The remaining scheduler responsibility: execute ready work and release fanout.
  • The fact that S0/S1/S2 can complete dependency-only/dummy fanout in parallel.

Test Updates

  • tests/ut/cpp/a2a3/test_wiring.cpp
  • tests/ut/cpp/a5/test_wiring.cpp

Updated wiring tests to exercise Orch-side helpers directly.

Removed obsolete tests for the deleted SPSC wiring queue:

  • tests/ut/cpp/a2a3/test_spsc_queue.cpp
  • tests/ut/cpp/a5/test_spsc_queue.cpp

Updated:

  • tests/ut/cpp/CMakeLists.txt

Current Behavior

  1. wfanin == 0: Orch seeds fanin state, records the dep-pool position, and routes the task ready.
  2. wfanin > 0 and all producers are already complete: Orch seeds fanin/dispatch state and routes the task ready.
  3. wfanin > 0 with unfinished producers: Orch wires fanout entries through the shared wiring helper before the task becomes visible to schedulers.
  4. Scheduler threads complete ready tasks and release fanout; they do not perform deferred wiring drain anymore.

Results

Previous 100-round Strategy Sweep

Benchmark: qwen3_14b_decode, StressBatch16Seq3500, a2a3, device 4, tensormap_and_ringbuffer, 100 rounds per run. Values below are trimmed averages, dropping 10 low and 10 high rounds.

Negative delta means faster.

Set Device us Device vs Baseline Sched us Sched vs Baseline Orch us
Baseline (simpler-base) 2436.5 +0.0 2105.2 +0.0 383.3
S0/S1/S2 DummyDrain avg 2416.9 -19.6 2091.5 -13.7 -
O-side prewire avg 2400.4 -36.1 2075.2 -30.0 422.1

Readout:

  • Device improves by 36.1 us vs baseline, about 1.48%.
  • Scheduler completion improves by 30.0 us vs baseline, about 1.43%.
  • Compared with the S0/S1/S2 dummy-drain-only version, O-side prewire adds another 16.5 us Device improvement and 16.3 us Sched improvement.
  • Orch time increases by about 38.8 us vs baseline, which is the expected work shift from S to O. For qwen, O-side completion still has enough slack that the total Device/Sched time improves.

Latest pypto-lib qwen14b Decode 40L Check

Command:

task-submit --device 0 --max-time 0 --timeout 0 --run 'cd /data/pyptouser/zhangtao/zt/pypto-lb && .venv-bench/bin/python models/qwen3/14b/decode_layer.py -p a2a3 -d $TASK_DEVICE --validate-fwd --fwd-layers 40'

Result from after_lock_cleanup_qwen14b_40l_20260703_162637_dev0.log:

Metric Time
device_wall 35.527200 ms
orch 31.077500 ms
sched 35.157720 ms
runner_run 37.523351 ms

Correctness:

  • argmax match 16/16
  • sample match 16/16
  • logits 100.0000% within 5e-2
  • max_abs_err=0.0199

Compared with the earlier device-0 baseline average for this 40L check, device_wall and sched are within about 0.5%, so the cleanup does not show an obvious performance regression.

Tests

  • git diff --check
  • cmake --build tests/ut/cpp/build -j$(nproc)
  • ctest --test-dir tests/ut/cpp/build --output-on-failure -R '^(test_wiring|test_a2a3_orchestrator_fanin|test_a5_wiring|test_a5_orchestrator_fanin)$'
  • pypto-lib qwen14b decode 40L validation:
    • models/qwen3/14b/decode_layer.py -p a2a3 -d $TASK_DEVICE --validate-fwd --fwd-layers 40

liuchangwen and others added 2 commits July 3, 2026 09:17
Move selected wiring work to the orchestrator side while preserving scheduler fallback, ready-once routing, and concurrent dummy drain.

Co-authored-by: TaoZQY <zhangtaolqy@mail.ustc.edu.cn>
Keep the O-side prewire and ready-once scheduler path while adopting upstream local-buffer removal.

Co-authored-by: TaoZQY <zhangtaolqy@mail.ustc.edu.cn>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b89a8461-a2aa-41c7-9572-1919ecc15616

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR reworks the orchestrator/scheduler dependency-wiring pipeline: adding an explicit ready_state/mark_completed() slot-state model, inline fast-path wiring for zero-fanin/completed-producer tasks, a dep-pool spinlock guarding wiring, a new pending-ready wake-list discovery mechanism, and multi-thread dummy-queue draining.

Changes

Inline Wiring, Ready-State, and Wake-List Discovery

Layer / File(s) Summary
Slot state: ready_state and mark_completed
pto_runtime2_types.h
Adds PTO2ReadyState/PTO2CompletionFlag enums, a ready_state atomic, mark_completed()/is_completion_flag_set() helpers, and reuse reset for ready_state.
Orchestrator inline-wiring fast paths
pto_orchestrator.cpp
Caches producer state for "gone" checks, adds helpers for completed-fanin detection, inline-ready routing, and prewiring under dep-pool lock; submit_task_common now wires inline before falling back to the queue; alloc/reuse paths use mark_completed()/dep_pool_mark reset.
Scheduler dep-pool locking and wiring queue backoff
scheduler/pto_scheduler.h
Adds dep_pool_lock and lock/unlock helpers, wraps drain_wiring_queue dep-pool logic in the lock, and introduces a two-phase startup/steady backoff triggering poll_pending_ready().
Wake-list pending-ready discovery mechanism
scheduler/pto_scheduler.h
Adds PTO2WakeWaiter, layout wake-region fields, PendingReadyEntry/PendingReadyState/WakeListState, and helpers for claiming, scanning, registering, and polling pending-ready tasks.
Unified ready routing and completion wiring
scheduler/pto_scheduler.h
Introduces route_ready_once() used by wire_task, release_fanin_and_check_ready, and drain_wake_list; on_task_complete now uses mark_completed() and conditionally drains wake lists when discovery is enabled.
Arena layout and init wiring for wake regions and dep_pool_lock
shared/pto_runtime2_init.cpp
Extends reserve_layout with task_window_sizes and wake-region sizing, initializes dep_pool_lock, wires/resets wake_lists/pending_ready across init/reset/wire paths, and updates runtime_reserve_layout call site.
Dummy ready queue multi-thread draining
scheduler_dispatch.cpp
Switches dummy queue draining from thread 0 only to threads 0–2, and lowers the drain batch size from 16 to 8.

Estimated code review effort: 5 (Critical) | ~110 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant submit_task_common
  participant all_claimed_fanin_completed
  participant try_orch_prewire_task
  participant route_ready_once
  participant WiringQueue

  Caller->>submit_task_common: submit task
  submit_task_common->>all_claimed_fanin_completed: check claimed producers
  alt zero fanin or all completed
    submit_task_common->>route_ready_once: route inline-ready
  else fanin pending
    submit_task_common->>try_orch_prewire_task: attempt prewire (dep_pool_lock)
    alt prewire succeeds
      try_orch_prewire_task-->>submit_task_common: wired
    else prewire fails
      submit_task_common->>WiringQueue: push task (spin if full)
    end
  end
Loading
sequenceDiagram
  participant Task
  participant on_task_complete
  participant WakeListState
  participant route_ready_once
  participant pending_ready

  Task->>on_task_complete: task finishes
  on_task_complete->>on_task_complete: slot_state.mark_completed()
  alt discovery enabled
    on_task_complete->>WakeListState: drain_wake_list
    WakeListState->>route_ready_once: route woken consumer
    route_ready_once->>pending_ready: requeue if fanin missing
  end
Loading

Possibly related PRs

  • hw-native-sys/simpler#1066: Both PRs modify fan-in/dependency handling in submit_task_common, one for producer dedup and the other for completion/"gone" status evaluation for inline wiring.
  • hw-native-sys/simpler#1141: Both PRs modify drain_wiring_queue dep-pool reclaim logic and the wiring-queue deadlock reporting section.
  • hw-native-sys/simpler#1245: Both PRs touch scheduler readiness/completion routing paths in pto_scheduler.h, including release_fanin_and_check_ready/on_task_complete.

Poem

A hop, a lock, a wake-list glow,
Inline paths let fast tasks go.
Ready-state bits now click in place,
Producers claimed at steady pace.
Three threads now drain the dummy queue —
This rabbit thumps approval too! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly captures the main change: moving dependency prewiring to the orchestrator side.
Description check ✅ Passed The description is directly about the same wiring refactor and scheduler/orchestrator behavior changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the task scheduling and wiring pipeline by allowing the orchestrator to inline already-ready tasks and prewire tasks directly, bypassing the scheduler's wiring queue. To support this, thread-safe locking is added to the dependency pool. Additionally, a disabled-by-default "discovery" wake-up mechanism is introduced to optimize task wake-ups, and dummy ready queue draining is parallelized across multiple scheduler threads. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 906-925: Preserve the dep_pool_mark for inline-ready tasks in PTO
orchestrator flow: in the inline-ready branches around route_orch_inline_ready
and try_orch_prewire_task, the slot is being made ready without going through
wire_task(), so the oldest live slot can end up with a zero mark and block
PTO2DepListPool::reclaim(). Update the inline path to assign the same
dep_pool_mark that wire_task() would have established before marking the task
ready, using the existing fanin_builder/current slot state helpers in
pto_orchestrator.cpp so dep-pool tail advancement still works when an inline
task becomes the head of the live window.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73457185-c6e8-4a0a-b250-b01562f2d41a

📥 Commits

Reviewing files that changed from the base of the PR and between 153daee and 15fef19.

📒 Files selected for processing (5)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
TaoZQY and others added 2 commits July 3, 2026 16:42
Co-authored-by: Crane-Liu <c.wliu@outlook.com>
Move the live-fanin dep_pool wait out of PTO2DepListPool::ensure_space
(fast 100k-spin fatal in <1ms) into the orchestrator's
orch_wire_live_fanin_task, mirroring the fanin spill pool's ensure_space
backstop: an absolute time budget plus a check for a fatal already
latched elsewhere.

With wiring now on the orchestrator, a workload that stalls the scheduler
(e.g. the scheduler_timeout fatal-code test, which lowers
PTO2_SCHEDULER_TIMEOUT_MS) used to see O fatal
PTO2_ERROR_DEP_POOL_OVERFLOW (-4) at submit time within <1ms, masking the
real scheduler-timeout code (-100). Now O bails without latching when
sched_error_code / orch_error_code is already set, so the root-cause code
surfaces. The residual dep_pool-deadlock backstop is sized to run strictly
longer than the scheduler timeout (scheduler timeout + alloc-deadlock
slack) so the scheduler's own fault is always observed first, instead of
racing a fixed 500ms budget that ties with a 500ms scheduler timeout under
load.

Applied to a2a3 and a5.

Test: tests/st/runtime_fatal_codes (sim, a2a3+a5) — scheduler_timeout
surfaces -100, dep_pool_overflow still -4, all cases pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants