refactor(tmr): remove PTO2LocalReadyBuffer local-first dispatch by ChaoWao · Pull Request #1245 · hw-native-sys/simpler

ChaoWao · 2026-07-02T01:57:11Z

Summary

Remove the PTO2LocalReadyBuffer local-first dispatch optimization from the
tensormap_and_ringbuffer scheduler on both a2a3 and a5. Ready tasks now go
straight to the shared MPMC ready_queues[] (DUMMY → dummy_ready_queue via
the existing push_ready_routed()). Delete the struct, its constants, the
local_bufs[] allocation, the spill/flush machinery, and drop the param from
all 10 affected signatures. a2a3 speculative-release / early-dispatch is
untouched (orthogonal).
Profiling cleanup (platform + Python): drop local_depth_* from
L2SwimlaneAicpuSchedPhaseRecord, the local_at_* params of
l2_swimlane_aicpu_record_sched_phase, the host JSON emit, and the
swimlane_converter.py handling. Shared-queue depth tracking retained.
C++ UT (tests/ut/cpp/{a2a3,a5}): remove LocalReadyBufferTest, rewrite the
get_ready_tasks_batch test to drain the shared queue.
Fix a latent kernel bug the timing change unmasked in
examples/workers/l3/ep_dispatch_combine: dispatch wrote recv_count_out with
a raw scalar GM store and never flushed it to HBM, so under the new dispatch
timing the downstream local_expert read it as 0 → zero rows → all-zero
output. Add a single-cache-line dcci after the write. Full root-cause in
docs/investigations/2026-07-local-buffer-removal-ep-combine-regression.md.
Restore the tmr benchmark set in tools/benchmark_rounds.sh (feat(dfx): [STRACE] host/device timing markers; single source of truth #1177 silently
dropped the 6 cases bench: add qwen3_14b_decode to tensormap_and_ringbuffer benchmark set #1157 added). spmd_paged_attention commented out pending a
pre-existing onboard stall (reproduces on baseline, unrelated).

Performance (Effective = orch∪sched window, 100 rounds, same locked a2a3 device)

Example	Base (µs)	HEAD (µs)	Change
alternating_matmul_add	786.0	737.4	−6.2%
benchmark_bgemm	789.3	673.2	−14.7%
paged_attention_unroll C1	1207.9	1149.2	−4.9%
paged_attention_unroll C2	619.0	573.8	−7.3%
paged_attention_unroll_manual_scope C1	1196.6	1146.6	−4.2%
paged_attention_unroll_manual_scope C2	617.4	566.9	−8.2%
batch_paged_attention	3807.7	3107.1	−18.4%
qwen3_14b_decode (StressBatch16Seq3500)	2210.0	2112.1	−4.4%

Removing the local buffer makes newly-ready tasks immediately visible to all
scheduler threads, improving load balance.

Testing

Simulation: a2a3sim + a5sim tmr suites pass; ep_dispatch_combine sim passes
Hardware (a2a3 onboard): tmr suite 33 passed / 1 skipped; ep_dispatch_combine 3/3 pass after the dcci fix (was 3/3 fail); l2_swimlane dfx incl. swimlane_converter smoke over the new JSON
C++ UT sources compile clean after the test updates
Benchmark on real a2a3 silicon (table above)

Note: pre-existing failure (not caused by this PR)

spmd_paged_attention fails onboard with 507018 S1:running-stalled (a
forward-progress stall, zero deadlock/capacity detectors). Verified it reproduces
identically on the merge-base baseline (local buffer present), so it is
pre-existing and unrelated. Commented out in the benchmark set.

🤖 Generated with Claude Code

coderabbitai · 2026-07-02T01:57:27Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42e1c5af-7c78-4333-a900-7d575de7eaa0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR removes the scheduler's per-thread local ready-buffer optimization (PTO2LocalReadyBuffer) across both a2a3 and a5 runtime variants, routing all ready-task dispatch through shared per-shape ready queues instead. Corresponding swimlane profiling structs, APIs, JSON export, and the converter tool are updated to remove local queue-depth snapshot fields/parameters, retaining only shared-queue depth data. Documentation is updated to match. A benchmark configuration script is separately extended with new example case lists.

Changes

Local Ready-Buffer Removal and Shared-Queue Profiling

Layer / File(s)	Summary
Swimlane profiling schema and API `src/a2a3/platform/include/common/l2_swimlane_profiling.h`, `src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`, `src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`, `src/a2a3/platform/shared/host/l2_swimlane_collector.cpp`, `src/a5/platform/include/common/l2_swimlane_profiling.h`, `src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h`, `src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`, `src/a5/platform/shared/host/l2_swimlane_collector.cpp`, `simpler_setup/tools/swimlane_converter.py`, `docs/scheduler.md`	`L2SwimlaneAicpuSchedPhaseRecord` drops `local_depth_at_start/end` arrays (keeping shared arrays and adjusting padding), `l2_swimlane_aicpu_record_sched_phase` drops `local_at_start/end` parameters, JSON export/converter/docs are updated to only reference shared queue depth.
Core scheduler ready-buffer removal `src/a2a3/runtime/.../pto_scheduler.h`, `src/a2a3/runtime/.../pto_async_wait.h`, `src/a5/runtime/.../pto_scheduler.h`, `src/a5/runtime/.../pto_async_wait.h`	`PTO2LocalReadyBuffer` type/constant and forward declarations are removed; `release_fanin_and_check_ready`, `get_ready_tasks_batch`, `on_task_complete`, and `AsyncWaitList::poll_and_complete` no longer take local buffer parameters and route directly via `push_ready_routed`/global `ready_queues`.
SchedulerContext wiring `src/a2a3/runtime/.../scheduler_context.h`, `src/a2a3/runtime/.../scheduler_completion.cpp`, `src/a5/runtime/.../scheduler_context.h`, `src/a5/runtime/.../scheduler_completion.cpp`	`pop_ready_tasks_batch`, `dispatch_shape`, `dispatch_ready_tasks`, `complete_slot_task`, and `check_running_cores_for_completion` drop `local_bufs` parameters; `has_residual_mix` now checks only the global MIX ready queue.
Dispatch loop and shared-only phase profiling `src/a2a3/runtime/.../scheduler_dispatch.cpp`, `src/a5/runtime/.../scheduler_dispatch.cpp`	Local buffer allocation, spill/flush, and residual tracking are removed from `resolve_and_dispatch`; MIX/AIC/AIV gating uses shared queues; swimlane phase records (Complete, Wire, Dummy, Dispatch) capture and emit only shared queue-depth snapshots via `capture_phase_end`.

Estimated code review effort: 4 (Complex) | ~60 minutes

Benchmark Configuration Expansion

Layer / File(s)	Summary
Benchmark case list expansion `tools/benchmark_rounds.sh`	Adds new `TMR_EXAMPLE_CASES` entries for `benchmark_bgemm`, `paged_attention_unroll` variants, `batch_paged_attention`, updates `qwen3_14b_decode`, and extends `TMR_EXAMPLE_ORDER` accordingly.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Sequence Diagram(s)

sequenceDiagram
  participant resolve_and_dispatch
  participant SchedulerContext
  participant ready_queues
  participant CoreTracker
  resolve_and_dispatch->>SchedulerContext: check_running_cores_for_completion(no local_bufs)
  SchedulerContext->>ready_queues: on_task_complete triggers push_ready_routed
  resolve_and_dispatch->>SchedulerContext: dispatch_ready_tasks(no local_bufs)
  SchedulerContext->>SchedulerContext: dispatch_shape(shape, tracker)
  SchedulerContext->>ready_queues: pop_ready_tasks_batch(shape, out, max_count)
  ready_queues-->>SchedulerContext: batch of ready tasks
  SchedulerContext->>CoreTracker: dispatch tasks to cores

Possibly related PRs

hw-native-sys/simpler#942: Both modify the AICPU scheduler-phase swimlane profiling schema/API (l2_swimlane_aicpu_record_sched_phase, L2SwimlaneAicpuSchedPhaseRecord).
hw-native-sys/simpler#1000: Directly related since this PR removes the local depth snapshot plumbing (local_at_*/local_depth_at_*) that #1000 had originally added.
hw-native-sys/simpler#989: Both touch the core scheduler dispatch/completion code paths in scheduler_dispatch.cpp and scheduler_completion.cpp.

Poem

A local burrow, snug and small,
gave way to shared queues, one for all. 🐰
No more hidden stash of tasks,
just open lanes where depth data basks.
Hop, dispatch, complete, repeat—
the warren runs on shared-queue feet!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: removing PTO2LocalReadyBuffer local-first dispatch in tmr.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description matches the changeset, covering local ready-buffer removal, profiling cleanup, benchmark restoration, and related fixes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request removes the thread-local ready buffer optimization (PTO2LocalReadyBuffer) and its associated profiling and swimlane tracking from the scheduler implementation across both the a2a3 and a5 platforms. The task dispatch and completion paths have been simplified to route tasks directly to the global per-shape ready queues. Additionally, several new benchmark cases have been added to the benchmark script. As there are no review comments provided, I have no feedback to offer on the review itself.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 738-752: The shared queue-depth snapshot logic is stale for phases
that enqueue work: in the dispatch/emit path, phases like Complete, Wire, and
Dummy currently reuse phase_start_shared for both start and end, so
shared_depth_at_end and the next phase’s start value can lag behind the actual
queue state. Update the emit flow to take a fresh end snapshot after any phase
that mutates shared queues, then assign that result back into phase_start_shared
before the next phase runs. Apply the same fix consistently anywhere this
pattern is used so the cached shared-depth sampling stays correct.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 653-662: The shared-depth snapshots used by queue-mutating phase
records are stale because Complete, Wire, and Dummy emit
l2_swimlane_aicpu_record_sched_phase with phase_start_shared for both endpoints
and never refresh it afterward. Update the phase emission logic in
scheduler_dispatch.cpp around the Complete/Wire/Dummy blocks so
phase_start_shared is advanced or re-sampled after these mutations, or
explicitly document the coarse sampling behavior if that is intended; use the
existing l2_swimlane_aicpu_record_sched_phase and phase_start_shared flow as the
anchor points.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 572515a7-ffe6-4e2c-8e30-f59126cebdcf

📥 Commits

Reviewing files that changed from the base of the PR and between a667325 and c488a12.

📒 Files selected for processing (21)

docs/scheduler.md
simpler_setup/tools/swimlane_converter.py
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a5/platform/include/common/l2_swimlane_profiling.h
src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
src/a5/platform/shared/host/l2_swimlane_collector.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
tools/benchmark_rounds.sh

💤 Files with no reviewable changes (2)

src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
src/a5/platform/shared/host/l2_swimlane_collector.cpp

The tensormap_and_ringbuffer scheduler carried a per-thread, per-CoreType thread-local ready buffer (PTO2LocalReadyBuffer) as a "local-first dispatch" fast path: a newly-ready consumer was try_push'd into the producing thread's local buffer (peer-invisible, zero-atomic) before falling back to the shared MPMC ready_queues[]. A spill/flush layer (capacity-gated overflow spill, flush_local_bufs, FlushGuard) existed only to stop a thread from hoarding work and starving peers. Remove it on both arches. The scheduler now always routes ready tasks straight to the shared ready_queues[] (DUMMY tasks still go to dummy_ready_queue via the existing push_ready_routed()). Correctness never depended on the local buffer — every consumer already had a nullptr/!try_push -> shared-queue fallback. Changes (a2a3 + a5, mirrored): - Delete struct PTO2LocalReadyBuffer, PTO2_LOCAL_DISPATCH_TYPE_NUM, LOCAL_READY_CAP_PER_TYPE, the local_bufs[] stack allocation. - Drop the PTO2LocalReadyBuffer param from release_fanin_and_check_ready, get_ready_tasks_batch, on_task_complete, pop_ready_tasks_batch, dispatch_shape, dispatch_ready_tasks, complete_slot_task, check_running_cores_for_completion, has_residual_mix, poll_and_complete (+ DrainCompletionSink field). - Delete the spill/flush machinery in dispatch_ready_tasks. - a2a3 speculative-release / early-dispatch machinery is untouched (orthogonal). C++ unit tests (tests/ut/cpp/{a2a3,a5}): delete the LocalReadyBufferTest suite in test_ready_queue.cpp and rewrite test_scheduler_state.cpp's GetReadyTasksBatch test to drain the shared queue (was: local-buffer-first); update the file/header comments. Profiling cleanup (platform + Python): drop local_depth_* from L2SwimlaneAicpuSchedPhaseRecord, the local_at_* params of l2_swimlane_aicpu_record_sched_phase, the host JSON local_at_* emit, and the swimlane_converter.py local_at_*/local_ready_buf handling. Shared-queue depth tracking is retained. Fix a latent kernel bug the timing change unmasked (examples/workers/l3/ ep_dispatch_combine): dispatch wrote recv_count_out with a raw scalar GM store and never flushed it to HBM, so the downstream local_expert task read it as 0 under the new dispatch timing -> zero rows processed -> all-zero output. Add a single-cache-line dcci after the recv_count_out write. Root cause + evidence in docs/investigations/2026-07-local-buffer-removal-ep-combine-regression.md. Also restore the tmr benchmark set in tools/benchmark_rounds.sh: hw-native-sys#1177 silently dropped the 6 cases hw-native-sys#1157 had just added, leaving only alternating_matmul_add. Restored benchmark_bgemm / paged_attention_unroll(+manual_scope) / batch_paged_attention / qwen3_14b_decode; spmd_paged_attention is commented out pending a pre-existing onboard stall (reproduces on baseline, unrelated). Verified on real a2a3 silicon (task-submit locked): - Sim: a2a3sim + a5sim tmr suites pass; ep_dispatch_combine sim passes. - Onboard: tmr suite 33 passed / 1 skipped; ep_dispatch_combine 3/3 pass after the dcci fix (was 3/3 fail); l2_swimlane dfx (incl. swimlane_converter smoke over the new JSON) passes. - C++ UT sources compile clean after the test updates. - Benchmark (Effective, orch∪sched window, 100 rounds, same session, same locked device): all measurable tmr cases improve -4% to -18% (qwen3_14b_decode -4.4%, batch_paged_attention -18%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jul 2, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp Outdated

ChaoWao force-pushed the worktree-snoopy-sparking-shell branch 2 times, most recently from ff91d4a to 8354a97 Compare July 2, 2026 07:51

ChaoWao force-pushed the worktree-snoopy-sparking-shell branch from 8354a97 to 87d001d Compare July 2, 2026 08:26

ChaoWao merged commit b31253f into hw-native-sys:main Jul 2, 2026
9 of 16 checks passed

ChaoWao deleted the worktree-snoopy-sparking-shell branch July 2, 2026 09:36

ChaoWao mentioned this pull request Jul 3, 2026

Add: host_build_graph runtime (host-orchestration variant of tensormap) #1185

Open

5 tasks

coderabbitai Bot mentioned this pull request Jul 3, 2026

optimize: prewire task dependencies on orchestrator side #1263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(tmr): remove PTO2LocalReadyBuffer local-first dispatch#1245

refactor(tmr): remove PTO2LocalReadyBuffer local-first dispatch#1245
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-snoopy-sparking-shell

ChaoWao commented Jul 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ChaoWao commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (Effective = orch∪sched window, 100 rounds, same locked a2a3 device)

Testing

Note: pre-existing failure (not caused by this PR)

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading