Skip to content

[Code Health] Extract the AICPU-side profiling operation layer (enqueue/pop/switch/flush/record) into a templated device engine + per-subsystem trait — symmetric to host ProfilerAlgorithms #1247

Description

@ChaoZheng109

Category

Technical Debt (cleanup, refactor)

Component

Platform (a2a3 / a2a3sim)

Description

The host side of the profiling framework is unified: ProfilerBase (poll/drain/collect loops) + BufferPoolManager (pools/queues) + ProfilerAlgorithms<Module> (the generic algorithm), one implementation for all five subsystems, each supplying a small Module trait. The AICPU (device) side has no such layer — each of L2Swimlane / PMU / DepGen / TensorDump / ScopeStats reimplements the same device-side logic in its own *_aicpu.cpp (per arch for the three arch-specific ones), 400–1000 lines each.

This issue is the device-side analog of the host unification: hoist the shared AICPU operation layer into one templated engine + per-subsystem trait. (An earlier framing scoped this to just the two low-level wait helpers; the real duplication is the whole operation layer above them.)

The operation layer is structurally identical across all five, and each copy has drifted. Every collector has an enqueue_*_ready_buffer (ready-queue push) and a switch_*_buffer (buffer rotation), and the switch skeleton is the same everywhere:

switch():
  1. null guards (state / current buffer)
  2. check free_queue for space (head==tail → drop: count dropped, reset, return)
  3. enqueue current full buffer to the ready queue
  4. pop a fresh buffer from free_queue (head+1, buffer_ptrs[head % SLOT_COUNT])
  5. install as current, reset count, wmb

The underlying queue layouts are already identical — the ready header exposes queue_heads[] / queue_tails[] / queues[][] and the free-queue exposes head / tail / buffer_ptrs[], differing only in type name (PmuDataHeader / DepGenDataHeader / …). Because there is no shared engine, the copies have drifted: the backpressure poll-mask, trailing wmb(), top-of-loop rmb(), and null-slot handling all differed per copy until #1162 aligned them one file at a time — and TensorDump's switch_dump_meta_buffer still uses an inline DUMP_SPIN_WAIT_LIMIT spin instead of the wait_for_free_queue_entry helper the others adopted, a live example of the drift. Every future device-side change has to be made 5+ times and kept in sync by hand.

Proposed structure

Layer Contents
Device collector engine (templated on a Module trait) enqueue_ready(buffer_ptr, seq) (ready-queue push), pop_free() (free-queue pop-and-install), switch_buffer() (the enqueue-full + pop-fresh + install skeleton above), flush() (flush the partial current buffer at teardown), record(...) (append + switch-if-full hot path), and the init/finalize skeleton (set header/pool pointers, reset state)
Per-subsystem trait header / free-queue / buffer types; buffer-kind count; *_READYQUEUE_SIZE / *_SLOT_COUNT / backpressure-cycle constants; the record-field-store hook; the drop-accounting hook; the instance shape (single-instance vs per-thread)

Each *_aicpu.cpp then collapses to a trait plus a few subsystem-specific hooks, instead of a full re-implementation of enqueue/pop/switch/flush/commit. This also erases the current per-copy drift (the TensorDump spin, the divergent logging, the differing drop-accounting) by construction.

Caveats

  • L2Swimlane is the outlier. It has 4 buffer kinds, per-core task pools + per-thread phase pools + AICore rotation, plus flush_phase_pool / switch_phase_buffer_kind. The single-instance subsystems (DepGen, ScopeStats) and simple per-thread ones (PMU, TensorDump) collapse cleanly into the engine; L2's multi-pool structure has to compose several engine instances or keep some subsystem-specific code on top. Don't force it into the base if it distorts the common case.
  • Device memory-ordering code. Every rmb() / wmb() must be preserved exactly; sim does not exercise weak-memory reordering, so this must be onboard-validated on both arches.

Location

(Symbols only — files are a moving target.)

  • Per-arch: src/{a2a3,a5}/platform/shared/aicpu/{l2_swimlane_collector,pmu_collector,dep_gen_collector}_aicpu.cppenqueue_*_ready_buffer, switch_*_buffer / switch_records_buffer / switch_phase_buffer_kind, flush_phase_pool, try_pop_*.
  • Common: src/common/platform/shared/aicpu/{tensor_dump_aicpu,scope_stats_collector_aicpu}.cppenqueue_*, switch_dump_meta_buffer (note the inline DUMP_SPIN_WAIT_LIMIT), switch_buffer.
  • Candidate home for the shared engine header: alongside src/common/platform/include/aicpu/.
  • Host precedent to mirror: src/common/platform/include/host/profiler_base.h (ProfilerAlgorithms<Module> + the Module trait contract).

Proposed Fix

Introduce one templated device-side collector engine (single source of truth) implementing enqueue/pop/switch/flush/record/init-finalize over a Module trait, included by all five AICPU collectors in place of the per-subsystem copies — the device analog of ProfilerAlgorithms<Module>. Mechanical, onboard-validated, preserving every barrier. Land after #1162 (which aligned the helper shapes and is the natural precursor).

Priority

Low (no impact today, good to fix eventually)

Related:

Metadata

Metadata

Assignees

No one assigned

    Labels

    code healthTechnical debt, robustness, code quality

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions