refactor(tmr): remove PTO2LocalReadyBuffer local-first dispatch#1245
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR removes the scheduler's per-thread local ready-buffer optimization ( ChangesLocal Ready-Buffer Removal and Shared-Queue Profiling
Estimated code review effort: 4 (Complex) | ~60 minutes Benchmark Configuration Expansion
Estimated code review effort: 1 (Trivial) | ~5 minutes Sequence Diagram(s)sequenceDiagram
participant resolve_and_dispatch
participant SchedulerContext
participant ready_queues
participant CoreTracker
resolve_and_dispatch->>SchedulerContext: check_running_cores_for_completion(no local_bufs)
SchedulerContext->>ready_queues: on_task_complete triggers push_ready_routed
resolve_and_dispatch->>SchedulerContext: dispatch_ready_tasks(no local_bufs)
SchedulerContext->>SchedulerContext: dispatch_shape(shape, tracker)
SchedulerContext->>ready_queues: pop_ready_tasks_batch(shape, out, max_count)
ready_queues-->>SchedulerContext: batch of ready tasks
SchedulerContext->>CoreTracker: dispatch tasks to cores
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request removes the thread-local ready buffer optimization (PTO2LocalReadyBuffer) and its associated profiling and swimlane tracking from the scheduler implementation across both the a2a3 and a5 platforms. The task dispatch and completion paths have been simplified to route tasks directly to the global per-shape ready queues. Additionally, several new benchmark cases have been added to the benchmark script. As there are no review comments provided, I have no feedback to offer on the review itself.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 738-752: The shared queue-depth snapshot logic is stale for phases
that enqueue work: in the dispatch/emit path, phases like Complete, Wire, and
Dummy currently reuse phase_start_shared for both start and end, so
shared_depth_at_end and the next phase’s start value can lag behind the actual
queue state. Update the emit flow to take a fresh end snapshot after any phase
that mutates shared queues, then assign that result back into phase_start_shared
before the next phase runs. Apply the same fix consistently anywhere this
pattern is used so the cached shared-depth sampling stays correct.
In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 653-662: The shared-depth snapshots used by queue-mutating phase
records are stale because Complete, Wire, and Dummy emit
l2_swimlane_aicpu_record_sched_phase with phase_start_shared for both endpoints
and never refresh it afterward. Update the phase emission logic in
scheduler_dispatch.cpp around the Complete/Wire/Dummy blocks so
phase_start_shared is advanced or re-sampled after these mutations, or
explicitly document the coarse sampling behavior if that is intended; use the
existing l2_swimlane_aicpu_record_sched_phase and phase_start_shared flow as the
anchor points.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 572515a7-ffe6-4e2c-8e30-f59126cebdcf
📒 Files selected for processing (21)
docs/scheduler.mdsimpler_setup/tools/swimlane_converter.pysrc/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/shared/host/l2_swimlane_collector.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a5/platform/include/common/l2_swimlane_profiling.hsrc/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cppsrc/a5/platform/shared/host/l2_swimlane_collector.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpptools/benchmark_rounds.sh
💤 Files with no reviewable changes (2)
- src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
- src/a5/platform/shared/host/l2_swimlane_collector.cpp
ff91d4a to
8354a97
Compare
The tensormap_and_ringbuffer scheduler carried a per-thread, per-CoreType
thread-local ready buffer (PTO2LocalReadyBuffer) as a "local-first dispatch"
fast path: a newly-ready consumer was try_push'd into the producing thread's
local buffer (peer-invisible, zero-atomic) before falling back to the shared
MPMC ready_queues[]. A spill/flush layer (capacity-gated overflow spill,
flush_local_bufs, FlushGuard) existed only to stop a thread from hoarding work
and starving peers.
Remove it on both arches. The scheduler now always routes ready tasks straight
to the shared ready_queues[] (DUMMY tasks still go to dummy_ready_queue via the
existing push_ready_routed()). Correctness never depended on the local buffer —
every consumer already had a nullptr/!try_push -> shared-queue fallback.
Changes (a2a3 + a5, mirrored):
- Delete struct PTO2LocalReadyBuffer, PTO2_LOCAL_DISPATCH_TYPE_NUM,
LOCAL_READY_CAP_PER_TYPE, the local_bufs[] stack allocation.
- Drop the PTO2LocalReadyBuffer param from release_fanin_and_check_ready,
get_ready_tasks_batch, on_task_complete, pop_ready_tasks_batch, dispatch_shape,
dispatch_ready_tasks, complete_slot_task, check_running_cores_for_completion,
has_residual_mix, poll_and_complete (+ DrainCompletionSink field).
- Delete the spill/flush machinery in dispatch_ready_tasks.
- a2a3 speculative-release / early-dispatch machinery is untouched (orthogonal).
C++ unit tests (tests/ut/cpp/{a2a3,a5}): delete the LocalReadyBufferTest suite
in test_ready_queue.cpp and rewrite test_scheduler_state.cpp's
GetReadyTasksBatch test to drain the shared queue (was: local-buffer-first);
update the file/header comments.
Profiling cleanup (platform + Python): drop local_depth_* from
L2SwimlaneAicpuSchedPhaseRecord, the local_at_* params of
l2_swimlane_aicpu_record_sched_phase, the host JSON local_at_* emit, and the
swimlane_converter.py local_at_*/local_ready_buf handling. Shared-queue depth
tracking is retained.
Fix a latent kernel bug the timing change unmasked (examples/workers/l3/
ep_dispatch_combine): dispatch wrote recv_count_out with a raw scalar GM store
and never flushed it to HBM, so the downstream local_expert task read it as 0
under the new dispatch timing -> zero rows processed -> all-zero output. Add a
single-cache-line dcci after the recv_count_out write. Root cause + evidence in
docs/investigations/2026-07-local-buffer-removal-ep-combine-regression.md.
Also restore the tmr benchmark set in tools/benchmark_rounds.sh: hw-native-sys#1177 silently
dropped the 6 cases hw-native-sys#1157 had just added, leaving only alternating_matmul_add.
Restored benchmark_bgemm / paged_attention_unroll(+manual_scope) /
batch_paged_attention / qwen3_14b_decode; spmd_paged_attention is commented out
pending a pre-existing onboard stall (reproduces on baseline, unrelated).
Verified on real a2a3 silicon (task-submit locked):
- Sim: a2a3sim + a5sim tmr suites pass; ep_dispatch_combine sim passes.
- Onboard: tmr suite 33 passed / 1 skipped; ep_dispatch_combine 3/3 pass after
the dcci fix (was 3/3 fail); l2_swimlane dfx (incl. swimlane_converter smoke
over the new JSON) passes.
- C++ UT sources compile clean after the test updates.
- Benchmark (Effective, orch∪sched window, 100 rounds, same session, same
locked device): all measurable tmr cases improve -4% to -18%
(qwen3_14b_decode -4.4%, batch_paged_attention -18%).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8354a97 to
87d001d
Compare
Summary
PTO2LocalReadyBufferlocal-first dispatch optimization from thetensormap_and_ringbuffer scheduler on both a2a3 and a5. Ready tasks now go
straight to the shared MPMC
ready_queues[](DUMMY →dummy_ready_queueviathe existing
push_ready_routed()). Delete the struct, its constants, thelocal_bufs[]allocation, the spill/flush machinery, and drop the param fromall 10 affected signatures. a2a3 speculative-release / early-dispatch is
untouched (orthogonal).
local_depth_*fromL2SwimlaneAicpuSchedPhaseRecord, thelocal_at_*params ofl2_swimlane_aicpu_record_sched_phase, the host JSON emit, and theswimlane_converter.pyhandling. Shared-queue depth tracking retained.tests/ut/cpp/{a2a3,a5}): removeLocalReadyBufferTest, rewrite theget_ready_tasks_batchtest to drain the shared queue.examples/workers/l3/ep_dispatch_combine: dispatch wroterecv_count_outwitha raw scalar GM store and never flushed it to HBM, so under the new dispatch
timing the downstream
local_expertread it as 0 → zero rows → all-zerooutput. Add a single-cache-line
dcciafter the write. Full root-cause indocs/investigations/2026-07-local-buffer-removal-ep-combine-regression.md.tools/benchmark_rounds.sh(feat(dfx): [STRACE] host/device timing markers; single source of truth #1177 silentlydropped the 6 cases bench: add qwen3_14b_decode to tensormap_and_ringbuffer benchmark set #1157 added).
spmd_paged_attentioncommented out pending apre-existing onboard stall (reproduces on baseline, unrelated).
Performance (Effective = orch∪sched window, 100 rounds, same locked a2a3 device)
Removing the local buffer makes newly-ready tasks immediately visible to all
scheduler threads, improving load balance.
Testing
Note: pre-existing failure (not caused by this PR)
spmd_paged_attentionfails onboard with507018 S1:running-stalled(aforward-progress stall, zero deadlock/capacity detectors). Verified it reproduces
identically on the merge-base baseline (local buffer present), so it is
pre-existing and unrelated. Commented out in the benchmark set.
🤖 Generated with Claude Code