Skip to content

refactor(tmr): remove PTO2LocalReadyBuffer local-first dispatch#1245

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-snoopy-sparking-shell
Jul 2, 2026
Merged

refactor(tmr): remove PTO2LocalReadyBuffer local-first dispatch#1245
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-snoopy-sparking-shell

Conversation

@ChaoWao

@ChaoWao ChaoWao commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Remove the PTO2LocalReadyBuffer local-first dispatch optimization from the
    tensormap_and_ringbuffer scheduler on both a2a3 and a5. Ready tasks now go
    straight to the shared MPMC ready_queues[] (DUMMY → dummy_ready_queue via
    the existing push_ready_routed()). Delete the struct, its constants, the
    local_bufs[] allocation, the spill/flush machinery, and drop the param from
    all 10 affected signatures. a2a3 speculative-release / early-dispatch is
    untouched (orthogonal).
  • Profiling cleanup (platform + Python): drop local_depth_* from
    L2SwimlaneAicpuSchedPhaseRecord, the local_at_* params of
    l2_swimlane_aicpu_record_sched_phase, the host JSON emit, and the
    swimlane_converter.py handling. Shared-queue depth tracking retained.
  • C++ UT (tests/ut/cpp/{a2a3,a5}): remove LocalReadyBufferTest, rewrite the
    get_ready_tasks_batch test to drain the shared queue.
  • Fix a latent kernel bug the timing change unmasked in
    examples/workers/l3/ep_dispatch_combine: dispatch wrote recv_count_out with
    a raw scalar GM store and never flushed it to HBM, so under the new dispatch
    timing the downstream local_expert read it as 0 → zero rows → all-zero
    output. Add a single-cache-line dcci after the write. Full root-cause in
    docs/investigations/2026-07-local-buffer-removal-ep-combine-regression.md.
  • Restore the tmr benchmark set in tools/benchmark_rounds.sh (feat(dfx): [STRACE] host/device timing markers; single source of truth #1177 silently
    dropped the 6 cases bench: add qwen3_14b_decode to tensormap_and_ringbuffer benchmark set #1157 added). spmd_paged_attention commented out pending a
    pre-existing onboard stall (reproduces on baseline, unrelated).

Performance (Effective = orch∪sched window, 100 rounds, same locked a2a3 device)

Example Base (µs) HEAD (µs) Change
alternating_matmul_add 786.0 737.4 −6.2%
benchmark_bgemm 789.3 673.2 −14.7%
paged_attention_unroll C1 1207.9 1149.2 −4.9%
paged_attention_unroll C2 619.0 573.8 −7.3%
paged_attention_unroll_manual_scope C1 1196.6 1146.6 −4.2%
paged_attention_unroll_manual_scope C2 617.4 566.9 −8.2%
batch_paged_attention 3807.7 3107.1 −18.4%
qwen3_14b_decode (StressBatch16Seq3500) 2210.0 2112.1 −4.4%

Removing the local buffer makes newly-ready tasks immediately visible to all
scheduler threads, improving load balance.

Testing

  • Simulation: a2a3sim + a5sim tmr suites pass; ep_dispatch_combine sim passes
  • Hardware (a2a3 onboard): tmr suite 33 passed / 1 skipped; ep_dispatch_combine 3/3 pass after the dcci fix (was 3/3 fail); l2_swimlane dfx incl. swimlane_converter smoke over the new JSON
  • C++ UT sources compile clean after the test updates
  • Benchmark on real a2a3 silicon (table above)

Note: pre-existing failure (not caused by this PR)

spmd_paged_attention fails onboard with 507018 S1:running-stalled (a
forward-progress stall, zero deadlock/capacity detectors). Verified it reproduces
identically on the merge-base baseline (local buffer present), so it is
pre-existing and unrelated. Commented out in the benchmark set.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42e1c5af-7c78-4333-a900-7d575de7eaa0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR removes the scheduler's per-thread local ready-buffer optimization (PTO2LocalReadyBuffer) across both a2a3 and a5 runtime variants, routing all ready-task dispatch through shared per-shape ready queues instead. Corresponding swimlane profiling structs, APIs, JSON export, and the converter tool are updated to remove local queue-depth snapshot fields/parameters, retaining only shared-queue depth data. Documentation is updated to match. A benchmark configuration script is separately extended with new example case lists.

Changes

Local Ready-Buffer Removal and Shared-Queue Profiling

Layer / File(s) Summary
Swimlane profiling schema and API
src/a2a3/platform/include/common/l2_swimlane_profiling.h, src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h, src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp, src/a2a3/platform/shared/host/l2_swimlane_collector.cpp, src/a5/platform/include/common/l2_swimlane_profiling.h, src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h, src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp, src/a5/platform/shared/host/l2_swimlane_collector.cpp, simpler_setup/tools/swimlane_converter.py, docs/scheduler.md
L2SwimlaneAicpuSchedPhaseRecord drops local_depth_at_start/end arrays (keeping shared arrays and adjusting padding), l2_swimlane_aicpu_record_sched_phase drops local_at_start/end parameters, JSON export/converter/docs are updated to only reference shared queue depth.
Core scheduler ready-buffer removal
src/a2a3/runtime/.../pto_scheduler.h, src/a2a3/runtime/.../pto_async_wait.h, src/a5/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_async_wait.h
PTO2LocalReadyBuffer type/constant and forward declarations are removed; release_fanin_and_check_ready, get_ready_tasks_batch, on_task_complete, and AsyncWaitList::poll_and_complete no longer take local buffer parameters and route directly via push_ready_routed/global ready_queues.
SchedulerContext wiring
src/a2a3/runtime/.../scheduler_context.h, src/a2a3/runtime/.../scheduler_completion.cpp, src/a5/runtime/.../scheduler_context.h, src/a5/runtime/.../scheduler_completion.cpp
pop_ready_tasks_batch, dispatch_shape, dispatch_ready_tasks, complete_slot_task, and check_running_cores_for_completion drop local_bufs parameters; has_residual_mix now checks only the global MIX ready queue.
Dispatch loop and shared-only phase profiling
src/a2a3/runtime/.../scheduler_dispatch.cpp, src/a5/runtime/.../scheduler_dispatch.cpp
Local buffer allocation, spill/flush, and residual tracking are removed from resolve_and_dispatch; MIX/AIC/AIV gating uses shared queues; swimlane phase records (Complete, Wire, Dummy, Dispatch) capture and emit only shared queue-depth snapshots via capture_phase_end.

Estimated code review effort: 4 (Complex) | ~60 minutes

Benchmark Configuration Expansion

Layer / File(s) Summary
Benchmark case list expansion
tools/benchmark_rounds.sh
Adds new TMR_EXAMPLE_CASES entries for benchmark_bgemm, paged_attention_unroll variants, batch_paged_attention, updates qwen3_14b_decode, and extends TMR_EXAMPLE_ORDER accordingly.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Sequence Diagram(s)

sequenceDiagram
  participant resolve_and_dispatch
  participant SchedulerContext
  participant ready_queues
  participant CoreTracker
  resolve_and_dispatch->>SchedulerContext: check_running_cores_for_completion(no local_bufs)
  SchedulerContext->>ready_queues: on_task_complete triggers push_ready_routed
  resolve_and_dispatch->>SchedulerContext: dispatch_ready_tasks(no local_bufs)
  SchedulerContext->>SchedulerContext: dispatch_shape(shape, tracker)
  SchedulerContext->>ready_queues: pop_ready_tasks_batch(shape, out, max_count)
  ready_queues-->>SchedulerContext: batch of ready tasks
  SchedulerContext->>CoreTracker: dispatch tasks to cores
Loading

Possibly related PRs

  • hw-native-sys/simpler#942: Both modify the AICPU scheduler-phase swimlane profiling schema/API (l2_swimlane_aicpu_record_sched_phase, L2SwimlaneAicpuSchedPhaseRecord).
  • hw-native-sys/simpler#1000: Directly related since this PR removes the local depth snapshot plumbing (local_at_*/local_depth_at_*) that #1000 had originally added.
  • hw-native-sys/simpler#989: Both touch the core scheduler dispatch/completion code paths in scheduler_dispatch.cpp and scheduler_completion.cpp.

Poem

A local burrow, snug and small,
gave way to shared queues, one for all. 🐰
No more hidden stash of tasks,
just open lanes where depth data basks.
Hop, dispatch, complete, repeat—
the warren runs on shared-queue feet!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: removing PTO2LocalReadyBuffer local-first dispatch in tmr.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description matches the changeset, covering local ready-buffer removal, profiling cleanup, benchmark restoration, and related fixes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the thread-local ready buffer optimization (PTO2LocalReadyBuffer) and its associated profiling and swimlane tracking from the scheduler implementation across both the a2a3 and a5 platforms. The task dispatch and completion paths have been simplified to route tasks directly to the global per-shape ready queues. Additionally, several new benchmark cases have been added to the benchmark script. As there are no review comments provided, I have no feedback to offer on the review itself.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 738-752: The shared queue-depth snapshot logic is stale for phases
that enqueue work: in the dispatch/emit path, phases like Complete, Wire, and
Dummy currently reuse phase_start_shared for both start and end, so
shared_depth_at_end and the next phase’s start value can lag behind the actual
queue state. Update the emit flow to take a fresh end snapshot after any phase
that mutates shared queues, then assign that result back into phase_start_shared
before the next phase runs. Apply the same fix consistently anywhere this
pattern is used so the cached shared-depth sampling stays correct.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 653-662: The shared-depth snapshots used by queue-mutating phase
records are stale because Complete, Wire, and Dummy emit
l2_swimlane_aicpu_record_sched_phase with phase_start_shared for both endpoints
and never refresh it afterward. Update the phase emission logic in
scheduler_dispatch.cpp around the Complete/Wire/Dummy blocks so
phase_start_shared is advanced or re-sampled after these mutations, or
explicitly document the coarse sampling behavior if that is intended; use the
existing l2_swimlane_aicpu_record_sched_phase and phase_start_shared flow as the
anchor points.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 572515a7-ffe6-4e2c-8e30-f59126cebdcf

📥 Commits

Reviewing files that changed from the base of the PR and between a667325 and c488a12.

📒 Files selected for processing (21)
  • docs/scheduler.md
  • simpler_setup/tools/swimlane_converter.py
  • src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a5/platform/include/common/l2_swimlane_profiling.h
  • src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a5/platform/shared/host/l2_swimlane_collector.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • tools/benchmark_rounds.sh
💤 Files with no reviewable changes (2)
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/a5/platform/shared/host/l2_swimlane_collector.cpp

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp Outdated
@ChaoWao ChaoWao force-pushed the worktree-snoopy-sparking-shell branch 2 times, most recently from ff91d4a to 8354a97 Compare July 2, 2026 07:51
The tensormap_and_ringbuffer scheduler carried a per-thread, per-CoreType
thread-local ready buffer (PTO2LocalReadyBuffer) as a "local-first dispatch"
fast path: a newly-ready consumer was try_push'd into the producing thread's
local buffer (peer-invisible, zero-atomic) before falling back to the shared
MPMC ready_queues[]. A spill/flush layer (capacity-gated overflow spill,
flush_local_bufs, FlushGuard) existed only to stop a thread from hoarding work
and starving peers.

Remove it on both arches. The scheduler now always routes ready tasks straight
to the shared ready_queues[] (DUMMY tasks still go to dummy_ready_queue via the
existing push_ready_routed()). Correctness never depended on the local buffer —
every consumer already had a nullptr/!try_push -> shared-queue fallback.

Changes (a2a3 + a5, mirrored):
- Delete struct PTO2LocalReadyBuffer, PTO2_LOCAL_DISPATCH_TYPE_NUM,
  LOCAL_READY_CAP_PER_TYPE, the local_bufs[] stack allocation.
- Drop the PTO2LocalReadyBuffer param from release_fanin_and_check_ready,
  get_ready_tasks_batch, on_task_complete, pop_ready_tasks_batch, dispatch_shape,
  dispatch_ready_tasks, complete_slot_task, check_running_cores_for_completion,
  has_residual_mix, poll_and_complete (+ DrainCompletionSink field).
- Delete the spill/flush machinery in dispatch_ready_tasks.
- a2a3 speculative-release / early-dispatch machinery is untouched (orthogonal).

C++ unit tests (tests/ut/cpp/{a2a3,a5}): delete the LocalReadyBufferTest suite
in test_ready_queue.cpp and rewrite test_scheduler_state.cpp's
GetReadyTasksBatch test to drain the shared queue (was: local-buffer-first);
update the file/header comments.

Profiling cleanup (platform + Python): drop local_depth_* from
L2SwimlaneAicpuSchedPhaseRecord, the local_at_* params of
l2_swimlane_aicpu_record_sched_phase, the host JSON local_at_* emit, and the
swimlane_converter.py local_at_*/local_ready_buf handling. Shared-queue depth
tracking is retained.

Fix a latent kernel bug the timing change unmasked (examples/workers/l3/
ep_dispatch_combine): dispatch wrote recv_count_out with a raw scalar GM store
and never flushed it to HBM, so the downstream local_expert task read it as 0
under the new dispatch timing -> zero rows processed -> all-zero output. Add a
single-cache-line dcci after the recv_count_out write. Root cause + evidence in
docs/investigations/2026-07-local-buffer-removal-ep-combine-regression.md.

Also restore the tmr benchmark set in tools/benchmark_rounds.sh: hw-native-sys#1177 silently
dropped the 6 cases hw-native-sys#1157 had just added, leaving only alternating_matmul_add.
Restored benchmark_bgemm / paged_attention_unroll(+manual_scope) /
batch_paged_attention / qwen3_14b_decode; spmd_paged_attention is commented out
pending a pre-existing onboard stall (reproduces on baseline, unrelated).

Verified on real a2a3 silicon (task-submit locked):
- Sim: a2a3sim + a5sim tmr suites pass; ep_dispatch_combine sim passes.
- Onboard: tmr suite 33 passed / 1 skipped; ep_dispatch_combine 3/3 pass after
  the dcci fix (was 3/3 fail); l2_swimlane dfx (incl. swimlane_converter smoke
  over the new JSON) passes.
- C++ UT sources compile clean after the test updates.
- Benchmark (Effective, orch∪sched window, 100 rounds, same session, same
  locked device): all measurable tmr cases improve -4% to -18%
  (qwen3_14b_decode -4.4%, batch_paged_attention -18%).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the worktree-snoopy-sparking-shell branch from 8354a97 to 87d001d Compare July 2, 2026 08:26
@ChaoWao ChaoWao merged commit b31253f into hw-native-sys:main Jul 2, 2026
9 of 16 checks passed
@ChaoWao ChaoWao deleted the worktree-snoopy-sparking-shell branch July 2, 2026 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant