You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The profiling host path currently uses several mutexes — ready_shards (mutex+cv), done_shards (mutex), the striped free_queue writer lock, recycled (per-shard×kind mutex), the dev_to_host_ mapping mutex, and the collectors' per-core/thread record-vector mutexes. Most of these guard access that is genuinely multi-writer / multi-consumer in the current design, so they are not simply removable as-is (see #1237 for the collector-output side).
But with a strict per-lane partition (1 AICPU thread ↔ 1 drain ↔ 1 collector ↔ 1 recycled pool, + 1 replenish thread — already the shape the PR uses, sized by PLATFORM_MAX_AICPU_THREADS), the whole buffer pipeline can be made all-SPSC and lock-free, which maximizes host read throughput on the hot path.
Target pipeline (per lane q), every hop single-producer/single-consumer → lock-free (barriers only):
ready_shards → lock-free SPSC ring + replace the cv-blocking wait with a lock-free-compatible blocking/timeout primitive (must keep the 100 ms tick + execution_complete exit semantics). This is the main non-trivial piece.
done_shards → single consumer. Remove obtain_buffer's synchronous done-drain so replenish is the only done→recycled path (done[q] becomes collector[q]→replenish SPSC). Tradeoff: drain can no longer self-harvest done; it relies on replenish keeping recycled warm.
recycled → per-lane SPSC. Remove pop_recycled_any (cross-shard borrow) so recycled[q] has exactly one writer (replenish) and one reader (drain q). Cross-lane balancing moves into replenish instead (see below).
free_queue → single writer (drain[owner]). Depends on the invariant below.
alloc off the hot path, on replenish, batched + proactive. Move alloc_and_register out of the drain hot path to the replenish thread; allocate one big block and register it once, then carve it into N buffers (amortizes the expensive HAL registration vs N separate reg calls); drive it by a watermark so the drain path is pure-pop and never allocs. Single-thread allocator ⇒ no lock. Implementation consequence: resolve_host_ptr must handle sub-buffers of a block — register the block once and add N offset-computed dev_to_host_ entries, or switch to range-based resolution.
init: seed each shard's recycled evenly. Today init pushes all surplus buffers with the default shard=0, so the whole surplus lands in recycled[0]; the current design only tolerates that because pop_recycled_any redistributes. Removing cross-shard borrow (efactor: Rename Graph to Runtime to Better Reflect Its Responsibility #3) requires init to distribute the surplus across shards.
Hard invariant this depends on
Core/instance → AICPU-thread ownership must be static (no migration / work-stealing). Verified in the current code: both runtimes call assign_cores_to_threads() once at init (cluster-aligned round-robin, cluster ci → thread ci % N), each thread only completes/flushes its own cores, and sched/orch pools enqueue via a fixed thread index — so each free_queue is refilled by exactly one drain thread. If future work introduces core migration/stealing, freeQ becomes multi-writer and this design breaks (it would need re-synchronization or a producer hand-off).
Compatible future extension: replenish cold→hot buffer balancing by routing done[cold] → recycled[hot] stays SPSC — replenish remains the sole writer of every recycled pool and sole consumer of every done shard, so it only changes the delivery target, not the number of readers/writers.
Open question (confirm before relying on lock-free freeQ)
The PR kept a striped free_queue writer lock citing "some runtime paths can reassign producer ownership across AICPU threads." A repo-wide search found no reassign/steal/rebalance/migrate path in either runtime, and orch_to_sched is not present in the code — so the lock looks defensive/forward-looking. Confirm with the author what that comment actually refers to; if a real (even rare) migration path exists, freeQ cannot be made lock-free without handling the hand-off.
Rework the buffer pool into the all-SPSC per-lane pipeline above: lock-free SPSC ready/done/recycled queues + a lock-free wait primitive for ready, replenish as the sole done→recycled mover and sole (batched, proactive, block-carving) allocator, even init seeding, and the static-ownership invariant made explicit. Sizeable enough to warrant a short design note / RFC first. Device memory-ordering change → must be onboard-validated (sim does not exercise weak-memory reordering).
Priority
Medium (minor risk, should fix in next few releases)
Related: #1162 (introduced the sharding this builds on), #1237 (collector-output sharding — shares the static-ownership linchpin and would be subsumed), #1247 (device-side plumbing unification), #997 (backpressure runs on top of these queues).
Category
Technical Debt (cleanup, refactor)
Component
Platform (a2a3 / a2a3sim)
Description
The profiling host path currently uses several mutexes —
ready_shards(mutex+cv),done_shards(mutex), the stripedfree_queuewriter lock,recycled(per-shard×kind mutex), thedev_to_host_mapping mutex, and the collectors' per-core/thread record-vector mutexes. Most of these guard access that is genuinely multi-writer / multi-consumer in the current design, so they are not simply removable as-is (see #1237 for the collector-output side).But with a strict per-lane partition (1 AICPU thread ↔ 1 drain ↔ 1 collector ↔ 1 recycled pool, + 1 replenish thread — already the shape the PR uses, sized by
PLATFORM_MAX_AICPU_THREADS), the whole buffer pipeline can be made all-SPSC and lock-free, which maximizes host read throughput on the hot path.Target pipeline (per lane q), every hop single-producer/single-consumer → lock-free (barriers only):
Sub-changes required (each with its tradeoff)
execution_completeexit semantics). This is the main non-trivial piece.obtain_buffer's synchronous done-drain so replenish is the only done→recycled path (done[q] becomes collector[q]→replenish SPSC). Tradeoff: drain can no longer self-harvest done; it relies on replenish keeping recycled warm.pop_recycled_any(cross-shard borrow) so recycled[q] has exactly one writer (replenish) and one reader (drain q). Cross-lane balancing moves into replenish instead (see below).alloc_and_registerout of the drain hot path to the replenish thread; allocate one big block and register it once, then carve it into N buffers (amortizes the expensive HAL registration vs N separate reg calls); drive it by a watermark so the drain path is pure-pop and never allocs. Single-thread allocator ⇒ no lock. Implementation consequence:resolve_host_ptrmust handle sub-buffers of a block — register the block once and add N offset-computeddev_to_host_entries, or switch to range-based resolution.shard=0, so the whole surplus lands inrecycled[0]; the current design only tolerates that becausepop_recycled_anyredistributes. Removing cross-shard borrow (efactor: Rename Graph to Runtime to Better Reflect Its Responsibility #3) requires init to distribute the surplus across shards.Hard invariant this depends on
Core/instance → AICPU-thread ownership must be static (no migration / work-stealing). Verified in the current code: both runtimes call
assign_cores_to_threads()once at init (cluster-aligned round-robin,cluster ci → thread ci % N), each thread only completes/flushes its own cores, and sched/orch pools enqueue via a fixed thread index — so each free_queue is refilled by exactly one drain thread. If future work introduces core migration/stealing, freeQ becomes multi-writer and this design breaks (it would need re-synchronization or a producer hand-off).Compatible future extension: replenish cold→hot buffer balancing by routing
done[cold] → recycled[hot]stays SPSC — replenish remains the sole writer of every recycled pool and sole consumer of every done shard, so it only changes the delivery target, not the number of readers/writers.Open question (confirm before relying on lock-free freeQ)
The PR kept a striped
free_queuewriter lock citing "some runtime paths can reassign producer ownership across AICPU threads." A repo-wide search found no reassign/steal/rebalance/migrate path in either runtime, andorch_to_schedis not present in the code — so the lock looks defensive/forward-looking. Confirm with the author what that comment actually refers to; if a real (even rare) migration path exists, freeQ cannot be made lock-free without handling the hand-off.Location
(Symbols only.) Host framework:
src/common/platform/include/host/buffer_pool_manager.h(ready/done/recycled shards,with_free_queue_writer,obtain_buffer,pop_recycled_any,alloc_and_register,resolve_host_ptr),src/common/platform/include/host/profiler_base.h(mgmt_drain_loop/mgmt_replenish_loop/poll_and_collect_loop,try_push_to_free_queue). Device seeding: each collector'sinitialize()recycled seeding. Ownership invariant:assign_cores_to_threads/find_core_owner_threadin both runtimes.Proposed Fix
Rework the buffer pool into the all-SPSC per-lane pipeline above: lock-free SPSC ready/done/recycled queues + a lock-free wait primitive for ready, replenish as the sole done→recycled mover and sole (batched, proactive, block-carving) allocator, even init seeding, and the static-ownership invariant made explicit. Sizeable enough to warrant a short design note / RFC first. Device memory-ordering change → must be onboard-validated (sim does not exercise weak-memory reordering).
Priority
Medium (minor risk, should fix in next few releases)
Related: #1162 (introduced the sharding this builds on), #1237 (collector-output sharding — shares the static-ownership linchpin and would be subsumed), #1247 (device-side plumbing unification), #997 (backpressure runs on top of these queues).