[Code Health] Redesign profiling buffer-pool as an all-SPSC lock-free per-lane pipeline (replenish-driven alloc/recycle)

### Category

Technical Debt (cleanup, refactor)

### Component

Platform (a2a3 / a2a3sim)

### Description

The profiling host path currently uses several mutexes — `ready_shards` (mutex+cv), `done_shards` (mutex), the striped `free_queue` writer lock, `recycled` (per-shard×kind mutex), the `dev_to_host_` mapping mutex, and the collectors' per-core/thread record-vector mutexes. Most of these guard access that is genuinely multi-writer / multi-consumer **in the current design**, so they are not simply removable as-is (see #1237 for the collector-output side).

But with a strict **per-lane** partition (1 AICPU thread ↔ 1 drain ↔ 1 collector ↔ 1 recycled pool, + 1 replenish thread — already the shape the PR uses, sized by `PLATFORM_MAX_AICPU_THREADS`), the whole buffer pipeline can be made **all-SPSC and lock-free**, which maximizes host read throughput on the hot path.

**Target pipeline (per lane q), every hop single-producer/single-consumer → lock-free (barriers only):**

```
freeQ[q] → AICPU q writes → readyQ[q] → drain q → ready_shard[q] → collector q
  → done_shard[q] → replenish → recycled[q] → drain q refills freeQ[q]
```

- readyQ[q]: AICPU q → drain q
- ready_shard[q]: drain q → collector q
- done_shard[q]: collector q → **replenish (sole consumer)**
- recycled[q]: **replenish (sole writer)** → drain q
- freeQ[q]: **drain q (sole writer)** → AICPU q

### Sub-changes required (each with its tradeoff)

1. **ready_shards → lock-free SPSC ring** + replace the cv-blocking wait with a lock-free-compatible blocking/timeout primitive (must keep the 100 ms tick + `execution_complete` exit semantics). This is the main non-trivial piece.
2. **done_shards → single consumer.** Remove `obtain_buffer`'s synchronous done-drain so replenish is the only done→recycled path (done[q] becomes collector[q]→replenish SPSC). Tradeoff: drain can no longer self-harvest done; it relies on replenish keeping recycled warm.
3. **recycled → per-lane SPSC.** Remove `pop_recycled_any` (cross-shard borrow) so recycled[q] has exactly one writer (replenish) and one reader (drain q). Cross-lane balancing moves into replenish instead (see below).
4. **free_queue → single writer** (drain[owner]). Depends on the invariant below.
5. **alloc off the hot path, on replenish, batched + proactive.** Move `alloc_and_register` out of the drain hot path to the replenish thread; allocate **one big block and register it once, then carve it into N buffers** (amortizes the expensive HAL registration vs N separate reg calls); drive it by a watermark so the drain path is pure-pop and never allocs. Single-thread allocator ⇒ no lock. Implementation consequence: `resolve_host_ptr` must handle sub-buffers of a block — register the block once and add N offset-computed `dev_to_host_` entries, or switch to range-based resolution.
6. **init: seed each shard's recycled evenly.** Today init pushes all surplus buffers with the default `shard=0`, so the whole surplus lands in `recycled[0]`; the current design only tolerates that because `pop_recycled_any` redistributes. Removing cross-shard borrow (#3) requires init to distribute the surplus across shards.

### Hard invariant this depends on

**Core/instance → AICPU-thread ownership must be static (no migration / work-stealing).** Verified in the current code: both runtimes call `assign_cores_to_threads()` once at init (cluster-aligned round-robin, `cluster ci → thread ci % N`), each thread only completes/flushes its own cores, and sched/orch pools enqueue via a fixed thread index — so each free_queue is refilled by exactly one drain thread. If future work introduces core migration/stealing, freeQ becomes multi-writer and this design breaks (it would need re-synchronization or a producer hand-off).

**Compatible** future extension: replenish cold→hot buffer balancing by routing `done[cold] → recycled[hot]` stays SPSC — replenish remains the sole writer of every recycled pool and sole consumer of every done shard, so it only changes the delivery target, not the number of readers/writers.

### Open question (confirm before relying on lock-free freeQ)

The PR kept a striped `free_queue` writer lock citing *"some runtime paths can reassign producer ownership across AICPU threads."* A repo-wide search found **no** reassign/steal/rebalance/migrate path in either runtime, and `orch_to_sched` is not present in the code — so the lock looks defensive/forward-looking. Confirm with the author what that comment actually refers to; if a real (even rare) migration path exists, freeQ cannot be made lock-free without handling the hand-off.

### Location

(Symbols only.) Host framework: `src/common/platform/include/host/buffer_pool_manager.h` (ready/done/recycled shards, `with_free_queue_writer`, `obtain_buffer`, `pop_recycled_any`, `alloc_and_register`, `resolve_host_ptr`), `src/common/platform/include/host/profiler_base.h` (`mgmt_drain_loop` / `mgmt_replenish_loop` / `poll_and_collect_loop`, `try_push_to_free_queue`). Device seeding: each collector's `initialize()` recycled seeding. Ownership invariant: `assign_cores_to_threads` / `find_core_owner_thread` in both runtimes.

### Proposed Fix

Rework the buffer pool into the all-SPSC per-lane pipeline above: lock-free SPSC ready/done/recycled queues + a lock-free wait primitive for ready, replenish as the sole done→recycled mover and sole (batched, proactive, block-carving) allocator, even init seeding, and the static-ownership invariant made explicit. Sizeable enough to warrant a short design note / RFC first. Device memory-ordering change → must be onboard-validated (sim does not exercise weak-memory reordering).

### Priority

Medium (minor risk, should fix in next few releases)

Related: #1162 (introduced the sharding this builds on), #1237 (collector-output sharding — shares the static-ownership linchpin and would be subsumed), #1247 (device-side plumbing unification), #997 (backpressure runs on top of these queues).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Code Health] Redesign profiling buffer-pool as an all-SPSC lock-free per-lane pipeline (replenish-driven alloc/recycle) #1251

Category

Component

Description

Sub-changes required (each with its tradeoff)

Hard invariant this depends on

Open question (confirm before relying on lock-free freeQ)

Location

Proposed Fix

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Code Health] Redesign profiling buffer-pool as an all-SPSC lock-free per-lane pipeline (replenish-driven alloc/recycle) #1251

Description

Category

Component

Description

Sub-changes required (each with its tradeoff)

Hard invariant this depends on

Open question (confirm before relying on lock-free freeQ)

Location

Proposed Fix

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions