diff --git a/docs/dfx/dep_gen.md b/docs/dfx/dep_gen.md
index 1da946890..7198ebd32 100644
--- a/docs/dfx/dep_gen.md
+++ b/docs/dfx/dep_gen.md
@@ -358,8 +358,8 @@ list; only the dep_gen replay graph loses the tail.
 | Layer | File | Role |
 | ----- | ---- | ---- |
 | Shared-mem layout | `src/{a2a3,a5}/platform/include/common/dep_gen.h` | `DepGenRecord` (4672 B base, cache-line aligned, ≤64 inline explicit_deps, per-task `block_num`) + `DepGenOverflowRecord` chain view (≤582 deps per slot) + SPSC ring + per-thread ready queue. Byte-identical layout across platforms. |
-| AICPU writer | `src/{a2a3,a5}/platform/{include,shared}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build. a5 reuses the a2a3 source verbatim — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. |
-| Host collector | `src/{a2a3,a5}/platform/{include/host,shared/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector. On a5 (no SVM) it uses the base `alloc_paired_buffer`, which malloc's a host shadow + `copy_to_device`'s it and registers it via `add_malloc_shadow` so teardown can free it; `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. |
+| AICPU writer | `src/{a2a3,a5}/platform/include/aicpu/dep_gen_collector_aicpu.h`, `src/common/platform/shared/aicpu/dep_gen_collector_aicpu.cpp` | Single-instance write path; weak-fallback exported to host build. Both platforms share the same writer implementation — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. |
+| Host collector | `src/common/platform/include/host/dep_gen_collector.h`, `src/common/platform/shared/host/dep_gen_collector.cpp` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector. On non-SVM platforms it uses the base `alloc_paired_buffer`, which malloc's a host shadow + `copy_to_device`'s it and registers it via `add_malloc_shadow` so teardown can free it; `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. |
 | Capture call site | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. The schema carries `kernel_ids[3] = {aic, aiv0, aiv1}` so the swimlane post-processor can resolve `task_id → kernel` from `deps.json` at level=1 where the AICore record is the sole device-side identity source. Inactive subslots stay at `INVALID_KERNEL_ID = -1`. It also carries the SPMD logical block num (`block_num` on a2a3, `core_num` on a5's launch spec) as `tasks[].block_num`. |
 | Replay | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. Platform-agnostic — a5 reuses the a2a3 source verbatim. |
 | Device-runner hookup | `src/{a2a3,a5}/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path)` |
diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md
index d7a051854..241e8b984 100644
--- a/docs/dfx/l2-swimlane-profiling.md
+++ b/docs/dfx/l2-swimlane-profiling.md
@@ -708,7 +708,7 @@ export_swimlane_json()             ← writes <output_prefix>/l2_swimlane_record
 finalize(unregister, free)
 ```
 
-[`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
+[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)
 on a2a3 inherits from
 [`profiling_common::ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>`](../src/common/platform/include/host/profiler_base.h):
 the base class owns split mgmt threads, collector shards, and the
@@ -838,7 +838,7 @@ l2_swimlane_collector_.export_swimlane_json()
 l2_swimlane_collector_.finalize()
 ```
 
-[`L2SwimlaneCollector`](../src/a5/platform/include/host/l2_swimlane_collector.h)
+[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)
 on a5 inherits the same CRTP base
 ([`profiling_common::ProfilerBase`](../src/common/platform/include/host/profiler_base.h))
 as a2a3 and parameterizes
diff --git a/docs/hardware/cache-coherency.md b/docs/hardware/cache-coherency.md
index c0174dc40..ae5e8494c 100644
--- a/docs/hardware/cache-coherency.md
+++ b/docs/hardware/cache-coherency.md
@@ -99,7 +99,7 @@ Two separate concerns, often conflated:
   `rmb()` between the COND check and the slot reads.
 
 Concretely, the L2 swimlane staging-slot read in
-`src/{a2a3,a5}/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` does
+`src/common/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` does
 **not** call `cache_invalidate_range` on the slot, but it **does** call
 `rmb()` before reading `slot->task_id` and the timing fields. All of
 those fields are AICore writes covered by the AICore-side `dcci` in
diff --git a/docs/profiling-framework.md b/docs/profiling-framework.md
index b5920b25d..f5b30b916 100644
--- a/docs/profiling-framework.md
+++ b/docs/profiling-framework.md
@@ -177,8 +177,8 @@ the required members are:
 
 The Module structs are defined alongside their collectors in
 [pmu_collector.h](../src/a2a3/platform/include/host/pmu_collector.h),
-[l2_swimlane_collector.h](../src/a2a3/platform/include/host/l2_swimlane_collector.h),
-[dep_gen_collector.h](../src/a2a3/platform/include/host/dep_gen_collector.h),
+[l2_swimlane_collector.h](../src/common/platform/include/host/l2_swimlane_collector.h),
+[dep_gen_collector.h](../src/common/platform/include/host/dep_gen_collector.h),
 [tensor_dump_collector.h](../src/common/platform/include/host/tensor_dump_collector.h),
 and
 [scope_stats_collector.h](../src/common/platform/include/host/scope_stats_collector.h)
@@ -336,13 +336,13 @@ Existing collectors are the canonical examples:
 
 - [`PmuCollector`](../src/a2a3/platform/include/host/pmu_collector.h)
   — single kind, per-core instances. See [pmu-profiling.md](dfx/pmu-profiling.md).
-- [`DepGenCollector`](../src/a2a3/platform/include/host/dep_gen_collector.h)
+- [`DepGenCollector`](../src/common/platform/include/host/dep_gen_collector.h)
   — single kind, one instance. See [dep_gen.md](dfx/dep_gen.md).
 - [`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h)
   — single kind, per-AICPU-thread instances. See [args-dump.md](dfx/args-dump.md).
 - [`ScopeStatsCollector`](../src/common/platform/include/host/scope_stats_collector.h)
   — single kind, one instance. See [scope-stats.md](dfx/scope-stats.md).
-- [`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
+- [`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)
   — four kinds (AICPU task, scheduler phase, orchestrator phase, AICore
   task), per-core / per-thread instances; the canonical multi-kind example. See
   [l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md).
diff --git a/src/a2a3/platform/include/host/dep_gen_collector.h b/src/a2a3/platform/include/host/dep_gen_collector.h
deleted file mode 100644
index e5f86a89d..000000000
--- a/src/a2a3/platform/include/host/dep_gen_collector.h
+++ /dev/null
@@ -1,294 +0,0 @@
-/*
- * Copyright (c) PyPTO Contributors.
- * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
- * CANN Open Software License Agreement Version 2.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- * -----------------------------------------------------------------------------------------------------------
- */
-
-/**
- * @file dep_gen_collector.h
- * @brief Host-side dep_gen (SubmitTrace) buffer allocation, streaming
- *        collection, and raw binary export.
- *
- * Architecture:
- * - BufferPoolManager<DepGenModule>: shared mgmt-thread infrastructure that
- *   polls per-thread ready queues, drains done-queue shards, and replenishes
- *   the single instance's free_queue from a unified recycled pool.
- * - DepGenCollector: collector thread shards pop full DepGenBuffers from the
- *   manager and append their DepGenRecords to a binary file
- *   (submit_trace.bin).
- *
- * Lifecycle:
- *   init()                       — Allocate header + 1 BufferState + N DepGenBuffers
- *                                  (pre-fills free_queue; surplus → recycled pool).
- *                                  Calls set_memory_context() on the base.
- *   start(tf)                    — Inherited: launches mgmt + collector threads.
- *   [device execution]
- *   stop()                       — Inherited: drain queues, join threads.
- *   reconcile_counters()         — Sanity-check current_buf_ptr is cleared by
- *                                  AICPU flush, run collected+dropped==total
- *                                  cross-check. If dropped_record_count > 0,
- *                                  the host caller skips deps.json emission
- *                                  (incomplete graph; user gets a warning).
- *   finalize()                   — Free all device memory, unregister.
- *
- * Output format (submit_trace.bin): a fixed-size header followed by a
- * contiguous stream of DepGenRecord values. Replay (future PR) reads this
- * back. Layout intentionally trivial (no varint / framing) so the
- * `sizeof(DepGenRecord)` ABI in `common/dep_gen.h` is the only contract.
- */
-
-#ifndef SRC_A2A3_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
-#define SRC_A2A3_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
-
-#include <atomic>
-#include <cstddef>
-#include <cstdint>
-#include <filesystem>
-#include <mutex>
-#include <optional>
-#include <string>
-#include <system_error>
-#include <vector>
-
-#include "common/dep_gen.h"
-#include "common/platform_config.h"
-#include "common/unified_log.h"
-#include "host/profiler_base.h"
-
-// ---------------------------------------------------------------------------
-// dep_gen Module (drives BufferPoolManager<DepGenModule>)
-// ---------------------------------------------------------------------------
-
-/**
- * Internal hand-off struct delivered from a drain thread to a collector shard.
- * thread_index identifies the AICPU thread queue the entry was popped from
- * (always equal to the orchestrator thread index, since dep_gen is single-
- * instance — exposed for symmetry with PmuReadyBufferInfo).
- */
-struct DepGenReadyBufferInfo {
-    uint32_t instance_index;  // Always 0 (single instance)
-    uint32_t thread_index;    // AICPU thread queue index this entry came from
-    void *dev_buffer_ptr;
-    void *host_buffer_ptr;
-    uint32_t buffer_seq;
-};
-
-struct DepGenModule {
-    using DataHeader = DepGenDataHeader;
-    using ReadyEntry = DepGenReadyQueueEntry;
-    using ReadyBufferInfo = ::DepGenReadyBufferInfo;
-    using FreeQueue = DepGenFreeQueue;
-
-    static constexpr int kBufferKinds = 1;
-    static constexpr uint32_t kReadyQueueSize = PLATFORM_DEP_GEN_READYQUEUE_SIZE;
-    static constexpr uint32_t kSlotCount = PLATFORM_DEP_GEN_SLOT_COUNT;
-    static constexpr const char *kSubsystemName = "DepGenModule";
-    static constexpr int kMgmtDrainThreadCount = PLATFORM_MAX_AICPU_THREADS;
-    static constexpr int kCollectorThreadCount = PLATFORM_MAX_AICPU_THREADS;
-
-    /**
-     * Buffers grown by proactive_replenish are batch-allocated up to the
-     * per-instance ceiling minus the slot count.
-     */
-    static constexpr int batch_size(int /*kind*/) {
-        constexpr int kBatch = PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE - PLATFORM_DEP_GEN_SLOT_COUNT;
-        return kBatch < 1 ? 1 : kBatch;
-    }
-
-    static DataHeader *header_from_shm(void *shm) { return get_dep_gen_header(shm); }
-
-    /**
-     * `count` is intentionally NOT reset here — AICPU is the sole writer and
-     * resets it itself on flush/drop/pop.
-     */
-    static std::optional<profiling_common::EntrySite<DepGenModule>>
-    resolve_entry(void *shm, DataHeader *header, int q, const ReadyEntry &entry) {
-        if (shm == nullptr || header == nullptr) {
-            LOG_ERROR("DepGenModule: invalid shared memory/header while resolving ready entry");
-            return std::nullopt;
-        }
-        if (header->num_instances != 1 || entry.instance_index >= header->num_instances) {
-            LOG_ERROR(
-                "DepGenModule: invalid ready entry instance=%u (num_instances=%u)", entry.instance_index,
-                header->num_instances
-            );
-            return std::nullopt;
-        }
-        DepGenBufferState *state = get_dep_gen_buffer_state(shm, static_cast<int>(entry.instance_index));
-        profiling_common::EntrySite<DepGenModule> site;
-        site.kind = 0;
-        site.free_queue = &state->free_queue;
-        site.buffer_size = sizeof(DepGenBuffer);
-        site.info.instance_index = entry.instance_index;
-        site.info.thread_index = static_cast<uint32_t>(q);
-        site.info.dev_buffer_ptr = reinterpret_cast<void *>(entry.buffer_ptr);
-        site.info.host_buffer_ptr = nullptr;  // filled by ProfilerAlgorithms
-        site.info.buffer_seq = entry.buffer_seq;
-        return site;
-    }
-
-    template <typename Cb>
-    static void for_each_instance(void *shm, DataHeader *header, Cb &&cb) {
-        const int n = static_cast<int>(header->num_instances);
-        for (int i = 0; i < n; i++) {
-            DepGenBufferState *state = get_dep_gen_buffer_state(shm, i);
-            cb(/*kind=*/0, &state->free_queue, sizeof(DepGenBuffer));
-        }
-    }
-};
-
-// ---------------------------------------------------------------------------
-// Memory callbacks — thin aliases for the canonical profiling_common shapes.
-// alloc / free are std::function so callers bind their MemoryAllocator via
-// lambda capture; register / unregister stay as plain function pointers
-// because they wrap stateless HAL globals (halHost*).
-// ---------------------------------------------------------------------------
-
-using DepGenAllocCallback = profiling_common::ProfAllocCallback;
-using DepGenRegisterCallback = profiling_common::ProfRegisterCallback;
-using DepGenUnregisterCallback = profiling_common::ProfUnregisterCallback;
-using DepGenFreeCallback = profiling_common::ProfFreeCallback;
-
-// ---------------------------------------------------------------------------
-// DepGenCollector
-// ---------------------------------------------------------------------------
-
-class DepGenCollector : public profiling_common::ProfilerBase<DepGenCollector, DepGenModule> {
-public:
-    DepGenCollector() = default;
-    ~DepGenCollector();
-
-    DepGenCollector(const DepGenCollector &) = delete;
-    DepGenCollector &operator=(const DepGenCollector &) = delete;
-
-    static constexpr int kIdleTimeoutSec = PLATFORM_DEP_GEN_TIMEOUT_SECONDS;
-    static constexpr const char *kSubsystemName = "DepGen";
-
-    /**
-     * Allocate dep_gen shared memory and pre-populate the free_queue.
-     *
-     * Allocates a DepGenDataHeader + 1 DepGenBufferState, plus
-     * PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE DepGenBuffers. The first
-     * PLATFORM_DEP_GEN_SLOT_COUNT buffers go directly into the free_queue;
-     * the surplus go into BufferPoolManager's shared recycled pool.
-     *
-     * @param num_threads     Number of AICPU scheduling threads (so the
-     *                        DataHeader sizes its per-thread ready queues)
-     * @param submit_trace_path  Output file path (.bin)
-     * @param alloc_cb        Memory allocation callback
-     * @param register_cb     halHostRegister callback (nullptr in sim)
-     * @param free_cb         Memory free callback
-     * @param device_id       Device ID
-     * @return 0 on success, non-zero on failure
-     */
-    int init(
-        int num_threads, const DepGenAllocCallback &alloc_cb, DepGenRegisterCallback register_cb,
-        const DepGenFreeCallback &free_cb, int device_id
-    );
-
-    /**
-     * Device pointer to the DepGenDataHeader. Set kernel_args.dep_gen_data_base
-     * to this after init() so AICPU can find the shared memory via
-     * set_platform_dep_gen_base().
-     */
-    void *get_dep_gen_shm_device_ptr() const { return shm_dev_; }
-
-    /**
-     * Per-buffer callback invoked by ProfilerBase's poll loop. Appends the
-     * buffer's DepGenRecord entries to the in-memory ``records_`` vector
-     * (no disk I/O — the host replay consumes that vector directly via
-     * ``records()`` once the device run completes).
-     */
-    void on_buffer_collected(const DepGenReadyBufferInfo &info);
-
-    /**
-     * After stop(): cross-check collected + dropped == total. If dropped > 0,
-     * the host caller skips deps.json emission so users get an incomplete-
-     * graph warning rather than partial data they might mistake for complete.
-     *
-     * @return true iff the run captured a complete trace (no drops, no leftovers).
-     */
-    bool reconcile_counters();
-
-    /**
-     * Free all device memory and release the in-memory record buffer. Idempotent.
-     */
-    void finalize(DepGenUnregisterCallback unregister_cb, const DepGenFreeCallback &free_cb);
-
-    /**
-     * @return true if init() succeeded and finalize() has not run.
-     */
-    bool is_initialized() const { return initialized_; }
-
-    /**
-     * Total DepGenRecords drained from the device-side ring buffer so far.
-     */
-    uint64_t total_collected() const { return total_collected_; }
-
-    /**
-     * In-memory record buffer (host replay's input). Valid between init()
-     * and finalize(); pointer/size stay stable after stop() returns, which
-     * is when the caller hands them to ``dep_gen_replay_emit_deps_json``.
-     */
-    const std::vector<DepGenRecord> &records() const { return records_; }
-
-private:
-    bool initialized_ = false;
-    int num_threads_ = 0;
-
-    // Shared memory region (DepGenDataHeader + DepGenBufferState[1]).
-    // shm_host_ / device_id_ live on ProfilerBase (set via set_memory_context
-    // in init()).
-    void *shm_dev_ = nullptr;
-    bool shm_registered_ = false;
-    size_t shm_size_ = 0;
-
-    bool buffers_registered_ = false;
-
-    // In-memory record buffer — drained from the device ring on
-    // on_buffer_collected() and consumed by the host replay directly (no
-    // disk hop). Mutex serializes the mgmt thread's appends against the
-    // (rare) reader on the same collector instance.
-    std::vector<DepGenRecord> records_;
-    std::mutex records_mutex_;
-
-    // Running total of records appended. Equal to ``records_.size()`` after
-    // every append; kept separately for the reconcile_counters cross-check
-    // even when records_ may be inspected concurrently.
-    uint64_t total_collected_ = 0;
-
-    DepGenDataHeader *dep_gen_header() const { return get_dep_gen_header(shm_host_); }
-    DepGenBufferState *dep_gen_state(int idx = 0) const { return get_dep_gen_buffer_state(shm_host_, idx); }
-
-    void append_buffer_records(const void *buf_host_ptr);
-};
-
-/**
- * Build the ``deps.json`` output path under the caller-provided per-task
- * directory. Filename is fixed (no timestamp) — the directory is the
- * per-task uniqueness boundary, mirroring make_pmu_csv_path() and the now-
- * removed make_dep_gen_path() for submit_trace.bin (deps.json is the only
- * on-disk dep_gen artifact since the in-memory capture refactor).
- */
-inline std::string make_deps_json_path(const std::string &output_dir) {
-    // Use std::filesystem::path's operator/ for join — robust against trailing
-    // slashes or path quirks that bare string concat would silently pass
-    // through. The sibling make_pmu_csv_path / make_l2_swimlane_path still use
-    // string concat; converting those is a follow-up cleanup since the
-    // project's output_prefix paths come from scene_test.py's pathlib join
-    // (never trailing-slashed in practice).
-    std::filesystem::path dir(output_dir);
-    std::error_code ec;
-    std::filesystem::create_directories(dir, ec);
-    if (ec) {
-        LOG_WARN("Failed to create dep_gen output directory %s: %s", output_dir.c_str(), ec.message().c_str());
-    }
-    return (dir / "deps.json").string();
-}
-
-#endif  // SRC_A2A3_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
diff --git a/src/a2a3/platform/include/host/l2_swimlane_collector.h b/src/a2a3/platform/include/host/l2_swimlane_collector.h
deleted file mode 100644
index b8bd2bb9b..000000000
--- a/src/a2a3/platform/include/host/l2_swimlane_collector.h
+++ /dev/null
@@ -1,499 +0,0 @@
-/*
- * Copyright (c) PyPTO Contributors.
- * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
- * CANN Open Software License Agreement Version 2.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- * -----------------------------------------------------------------------------------------------------------
- */
-
-/**
- * @file l2_swimlane_collector.h
- * @brief Platform-agnostic performance data collector with dynamic memory management.
- *
- * Architecture:
- * - BufferPoolManager<L2SwimlaneModule>: shared mgmt-thread infrastructure that polls
- *   the AICPU ready queue, replenishes per-core / per-thread free queues, and
- *   hands full buffers off to collector thread shards.
- * - L2SwimlaneCollector: collector thread shards copy records from manager ready queues
- *   into host vectors; the owner thread exports the swimlane visualization after stop().
- *
- * Memory operations are injected through callbacks for sim/onboard portability.
- */
-
-#ifndef SRC_A2A3_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
-#define SRC_A2A3_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
-
-#include <atomic>
-#include <array>
-#include <cstddef>
-#include <cstdint>
-#include <functional>
-#include <mutex>
-#include <string>
-#include <thread>
-#include <vector>
-
-#include "common/l2_swimlane_profiling.h"
-#include "common/memory_barrier.h"
-#include "common/platform_config.h"
-#include "common/unified_log.h"
-#include "host/profiler_base.h"
-
-// ---------------------------------------------------------------------------
-// L2 Perf profiling Module (drives BufferPoolManager<L2SwimlaneModule>)
-// ---------------------------------------------------------------------------
-
-/**
- * L2 Perf has four distinct buffer kinds going through one ready queue per
- * AICPU thread:
- *   - kind 0: per-core    L2SwimlaneAicpuTaskBuffer      (task records)
- *   - kind 1: per-thread  L2SwimlaneAicpuSchedPhaseBuffer (scheduler phase records)
- *   - kind 2: per-thread  L2SwimlaneAicpuOrchPhaseBuffer  (orchestrator phase records)
- *   - kind 3: per-core    L2SwimlaneAicoreTaskBuffer     (AICore-written records)
- * The ReadyQueueEntry::kind flag picks among them.
- */
-
-/**
- * Buffer kind discriminator carried in ReadyBufferInfo and used to index the
- * per-kind recycled pool inside BufferPoolManager. Values match
- * L2SwimlaneBufferKind 1:1.
- */
-enum class ProfBufferType {
-    AICPU_TASK = 0,
-    AICPU_SCHED_PHASE = 1,
-    AICPU_ORCH_PHASE = 2,
-    AICORE_TASK = 3,
-};
-
-/**
- * Information about a ready (full) buffer, passed from mgmt thread to main thread.
- */
-struct ReadyBufferInfo {
-    ProfBufferType type;
-    uint32_t index;         // core_index (task) or thread_idx (phase)
-    uint32_t slot_idx;      // Reserved (unused in free queue design)
-    void *dev_buffer_ptr;   // Device address of the full buffer
-    void *host_buffer_ptr;  // Host-mapped address (sim: same as dev)
-    uint32_t buffer_seq;    // Sequence number for ordering
-};
-
-struct L2SwimlaneModule {
-    using DataHeader = L2SwimlaneDataHeader;
-    using ReadyEntry = ReadyQueueEntry;
-    using ReadyBufferInfo = ::ReadyBufferInfo;
-    using FreeQueue = L2SwimlaneFreeQueue;  // all pool types share the same free_queue layout
-
-    static constexpr int kBufferKinds = 4;
-    static constexpr uint32_t kReadyQueueSize = PLATFORM_PROF_READYQUEUE_SIZE;
-    static constexpr uint32_t kSlotCount = PLATFORM_PROF_SLOT_COUNT;
-    static constexpr const char *kSubsystemName = "L2SwimlaneModule";
-    static constexpr int kMgmtDrainThreadCount = PLATFORM_MAX_AICPU_THREADS;
-    static constexpr int kCollectorThreadCount = PLATFORM_MAX_AICPU_THREADS;
-
-    /**
-     * batch_size for proactive_replenish's alloc fallback. Sized so that a
-     * fully empty recycled pool refills to the configured per-instance
-     * ceiling in one tick. Sched and orch phase pools are sized independently
-     * (PLATFORM_PROF_{SCHED,ORCH}_BUFFERS_PER_THREAD).
-     */
-    static constexpr int batch_size(int kind) {
-        constexpr int kPerfBatch = PLATFORM_PROF_BUFFERS_PER_CORE - PLATFORM_PROF_SLOT_COUNT;
-        constexpr int kSchedBatch = PLATFORM_PROF_SCHED_BUFFERS_PER_THREAD - PLATFORM_PROF_SLOT_COUNT;
-        constexpr int kOrchBatch = PLATFORM_PROF_ORCH_BUFFERS_PER_THREAD - PLATFORM_PROF_SLOT_COUNT;
-        constexpr int kAicoreBatch = PLATFORM_AICORE_BUFFERS_PER_CORE - PLATFORM_PROF_SLOT_COUNT;
-        int b = kPerfBatch;
-        switch (static_cast<L2SwimlaneBufferKind>(kind)) {
-        case L2SwimlaneBufferKind::AicpuTask:
-            b = kPerfBatch;
-            break;
-        case L2SwimlaneBufferKind::AicpuSchedPhase:
-            b = kSchedBatch;
-            break;
-        case L2SwimlaneBufferKind::AicpuOrchPhase:
-            b = kOrchBatch;
-            break;
-        case L2SwimlaneBufferKind::AicoreTask:
-            b = kAicoreBatch;
-            break;
-        }
-        return b < 1 ? 1 : b;
-    }
-
-    static int kind_of(const ReadyBufferInfo &info) { return static_cast<int>(info.type); }
-
-    static DataHeader *header_from_shm(void *shm) { return get_l2_swimlane_header(shm); }
-
-    template <typename Mgr>
-    static void refresh_replenish_metadata(Mgr &mgr, DataHeader *header) {
-        mgr.read_range_from_device(&header->num_sched_phase_threads, sizeof(header->num_sched_phase_threads));
-        mgr.read_range_from_device(&header->num_orch_phase_threads, sizeof(header->num_orch_phase_threads));
-        rmb();
-    }
-
-    /**
-     * Branch on entry.kind to pick the per-core task state, per-thread sched-
-     * or orch-phase state, or per-core AICore state. Returns nullopt for
-     * out-of-range kind or core_index.
-     */
-    static std::optional<profiling_common::EntrySite<L2SwimlaneModule>>
-    resolve_entry(void *shm, DataHeader *header, int /*q*/, const ReadyEntry &entry) {
-        const int num_cores = static_cast<int>(header->num_cores);
-        const L2SwimlaneBufferKind kind = entry.kind;
-
-        // Validate kind first — out-of-range silently falling into the wrong
-        // branch reads a wrong-typed pool.
-        if (kind != L2SwimlaneBufferKind::AicpuTask && kind != L2SwimlaneBufferKind::AicpuSchedPhase &&
-            kind != L2SwimlaneBufferKind::AicpuOrchPhase && kind != L2SwimlaneBufferKind::AicoreTask) {
-            LOG_ERROR("L2SwimlaneModule: invalid entry kind=%u", static_cast<uint32_t>(kind));
-            return std::nullopt;
-        }
-
-        // Sched/orch phase entries are indexed by thread_idx; task/aicore by core_index.
-        const bool is_phase =
-            (kind == L2SwimlaneBufferKind::AicpuSchedPhase) || (kind == L2SwimlaneBufferKind::AicpuOrchPhase);
-        if (is_phase) {
-            if (entry.core_index >= static_cast<uint32_t>(PLATFORM_MAX_AICPU_THREADS)) {
-                LOG_ERROR("L2SwimlaneModule: invalid phase entry: thread=%u", entry.core_index);
-                return std::nullopt;
-            }
-        } else {
-            if (entry.core_index >= static_cast<uint32_t>(num_cores)) {
-                LOG_ERROR(
-                    "L2SwimlaneModule: invalid task entry: core=%u kind=%u", entry.core_index,
-                    static_cast<uint32_t>(kind)
-                );
-                return std::nullopt;
-            }
-        }
-
-        profiling_common::EntrySite<L2SwimlaneModule> site;
-        site.kind = static_cast<int>(kind);
-        site.info.index = entry.core_index;
-        site.info.slot_idx = 0;
-        site.info.dev_buffer_ptr = reinterpret_cast<void *>(entry.buffer_ptr);
-        site.info.host_buffer_ptr = nullptr;  // filled by ProfilerAlgorithms
-        site.info.buffer_seq = entry.buffer_seq;
-
-        switch (kind) {
-        case L2SwimlaneBufferKind::AicpuTask: {
-            auto *state = get_perf_buffer_state(shm, static_cast<int>(entry.core_index));
-            site.free_queue = &state->free_queue;
-            site.buffer_size = sizeof(L2SwimlaneAicpuTaskBuffer);
-            site.info.type = ProfBufferType::AICPU_TASK;
-            break;
-        }
-        case L2SwimlaneBufferKind::AicpuSchedPhase: {
-            auto *state = get_sched_phase_buffer_state(shm, num_cores, static_cast<int>(entry.core_index));
-            site.free_queue = &state->free_queue;
-            site.buffer_size = sizeof(L2SwimlaneAicpuSchedPhaseBuffer);
-            site.info.type = ProfBufferType::AICPU_SCHED_PHASE;
-            break;
-        }
-        case L2SwimlaneBufferKind::AicpuOrchPhase: {
-            auto *state = get_orch_phase_buffer_state(shm, num_cores, static_cast<int>(entry.core_index));
-            site.free_queue = &state->free_queue;
-            site.buffer_size = sizeof(L2SwimlaneAicpuOrchPhaseBuffer);
-            site.info.type = ProfBufferType::AICPU_ORCH_PHASE;
-            break;
-        }
-        case L2SwimlaneBufferKind::AicoreTask: {
-            auto *ac_state = get_aicore_buffer_state(shm, num_cores, static_cast<int>(entry.core_index));
-            site.free_queue = &ac_state->free_queue;
-            site.buffer_size = sizeof(L2SwimlaneAicoreTaskBuffer);
-            site.info.type = ProfBufferType::AICORE_TASK;
-            break;
-        }
-        }
-        return site;
-    }
-
-    template <typename Cb>
-    static void for_each_instance(void *shm, DataHeader *header, Cb &&cb) {
-        const int num_cores = static_cast<int>(header->num_cores);
-
-        // AicpuTask: per-core (kind 0)
-        for (int i = 0; i < num_cores; i++) {
-            auto *state = get_perf_buffer_state(shm, i);
-            cb(/*kind=*/static_cast<int>(L2SwimlaneBufferKind::AicpuTask), &state->free_queue,
-               sizeof(L2SwimlaneAicpuTaskBuffer));
-        }
-
-        // AicoreTask: per-core (kind 3)
-        for (int i = 0; i < num_cores; i++) {
-            auto *ac_state = get_aicore_buffer_state(shm, num_cores, i);
-            cb(/*kind=*/static_cast<int>(L2SwimlaneBufferKind::AicoreTask), &ac_state->free_queue,
-               sizeof(L2SwimlaneAicoreTaskBuffer));
-        }
-
-        // AicpuSchedPhase: per-thread (kind 1) — gated on the header's
-        // sched-phase thread count (zero when phase init never ran).
-        // Bounds-clamp against PLATFORM_MAX_AICPU_THREADS so a corrupted
-        // device-shared value can't walk off the pool array.
-        int num_sched_phase_threads = static_cast<int>(header->num_sched_phase_threads);
-        if (num_sched_phase_threads > PLATFORM_MAX_AICPU_THREADS) {
-            num_sched_phase_threads = 0;
-        }
-        for (int t = 0; t < num_sched_phase_threads; t++) {
-            auto *state = get_sched_phase_buffer_state(shm, num_cores, t);
-            cb(/*kind=*/static_cast<int>(L2SwimlaneBufferKind::AicpuSchedPhase), &state->free_queue,
-               sizeof(L2SwimlaneAicpuSchedPhaseBuffer));
-        }
-
-        // AicpuOrchPhase: per-thread (kind 2) — same bounds clamp.
-        int num_orch_phase_threads = static_cast<int>(header->num_orch_phase_threads);
-        if (num_orch_phase_threads > PLATFORM_MAX_AICPU_THREADS) {
-            num_orch_phase_threads = 0;
-        }
-        for (int t = 0; t < num_orch_phase_threads; t++) {
-            auto *state = get_orch_phase_buffer_state(shm, num_cores, t);
-            cb(/*kind=*/static_cast<int>(L2SwimlaneBufferKind::AicpuOrchPhase), &state->free_queue,
-               sizeof(L2SwimlaneAicpuOrchPhaseBuffer));
-        }
-    }
-};
-
-// Memory callbacks — thin aliases for the canonical profiling_common shapes.
-// alloc / free are std::function so callers bind their MemoryAllocator via
-// lambda capture; register / unregister stay as plain function pointers
-// because they wrap stateless HAL globals (halHost*).
-using L2SwimlaneAllocCallback = profiling_common::ProfAllocCallback;
-using L2SwimlaneRegisterCallback = profiling_common::ProfRegisterCallback;
-using L2SwimlaneUnregisterCallback = profiling_common::ProfUnregisterCallback;
-using L2SwimlaneFreeCallback = profiling_common::ProfFreeCallback;
-
-// =============================================================================
-// L2SwimlaneCollector
-// =============================================================================
-
-/**
- * Performance data collector.
- *
- * Lifecycle:
- *   1. initialize()                — allocate shared memory, pre-fill free_queues,
- *                                    hand the memory context to the base via
- *                                    set_memory_context().
- *   2. start(tf)                   — inherited from ProfilerBase; launches
- *                                    drain/refill, replenish, and collector threads.
- *   3. ... device execution ...
- *   4. stop()                      — joins drain/refill and replenish before
- *                                    letting collector threads exit.
- *   5. read_phase_header_metadata() — single-shot read of the core→thread
- *                                    mapping from L2SwimlaneDataHeader.
- *   6. reconcile_counters()        — device-side three-bucket accounting for
- *                                    both PERF and PHASE pools (total /
- *                                    collected / dropped).
- *   7. export_swimlane_json() / finalize().
- *
- * Host never reads from device-side `current_buf_ptr` to recover records:
- * device flush is the only data path. Any non-zero `current_buf_ptr` after
- * stop() is logged as a bug.
- */
-class L2SwimlaneCollector : public profiling_common::ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule> {
-public:
-    L2SwimlaneCollector() = default;
-    ~L2SwimlaneCollector();
-
-    L2SwimlaneCollector(const L2SwimlaneCollector &) = delete;
-    L2SwimlaneCollector &operator=(const L2SwimlaneCollector &) = delete;
-
-    // ProfilerBase contract
-    static constexpr int kIdleTimeoutSec = PLATFORM_PROF_TIMEOUT_SECONDS;
-    static constexpr const char *kSubsystemName = "L2Swimlane";
-
-    /**
-     * Initialize performance profiling.
-     *
-     * Allocates the shared-memory region (header + per-core / per-thread
-     * BufferStates), pre-allocates initial L2SwimlaneAicpuTaskBuffers and PhaseBuffers,
-     * and seeds the per-pool free_queues + the framework's recycled pools.
-     *
-     * @param num_aicore               Number of AICore instances
-     * @param device_id                Device ID (forwarded to register_cb)
-     * @param l2_swimlane_level   Collection granularity (DISABLED / AICORE_TIMING
-     *                                 / AICPU_TIMING / SCHED_PHASES / ORCH_PHASES).
-     *                                 Written into
-     *                                 `L2SwimlaneDataHeader::l2_swimlane_level`
-     *                                 so AICPU can promote it in
-     *                                 `l2_swimlane_aicpu_init`, AND cached on the
-     *                                 collector so `export_swimlane_json()`
-     *                                 can gate phase sections and stamp the
-     *                                 JSON `version`.
-     * @param alloc_cb                 Device memory allocation callback
-     * @param register_cb              Memory registration callback (nullptr for
-     *                                 simulation)
-     * @param free_cb                  Device memory free callback
-     * @param user_data                Opaque pointer forwarded to callbacks
-     * @param output_prefix            Per-task directory; l2_swimlane_records.json
-     *                                 lands here. Required (non-empty);
-     *                                 CallConfig::validate() enforces this
-     *                                 upstream.
-     * @return 0 on success, error code on failure
-     */
-    int initialize(
-        int num_aicore, int aicpu_thread_num, int device_id, L2SwimlaneLevel l2_swimlane_level,
-        const L2SwimlaneAllocCallback &alloc_cb, L2SwimlaneRegisterCallback register_cb,
-        const L2SwimlaneFreeCallback &free_cb, const std::string &output_prefix
-    );
-
-    /**
-     * Per-buffer callback invoked by ProfilerBase's collector loop. Dispatches on
-     * info.type to copy either an L2SwimlaneAicpuTaskBuffer (PERF_RECORD) into the per-core
-     * record vector, or a L2SwimlaneAicpuSchedPhaseBuffer / L2SwimlaneAicpuOrchPhaseBuffer into the per-thread
-     * phase-record vector.
-     */
-    void on_buffer_collected(const ReadyBufferInfo &info);
-
-    /**
-     * Publish per-core core_type (AIC/AIV/...) so the host emit path can
-     * resolve the lane label without consulting an AICPU task record. Required
-     * for AICORE_TIMING (level=1) where complete_task is bypassed and the
-     * AICore record alone is on disk. Caller is the device_runner — sim sets
-     * it from `runtime.workers[i].core_type` (rule-based), onboard sets it
-     * from the handshake-discovered table.
-     *
-     * Safe to call multiple times; the last call wins.
-     *
-     * @param types  CoreType[n] table indexed by core_id
-     * @param n      table length (typically `num_aicore`)
-     */
-    void set_core_types(const CoreType *types, int n);
-
-    /**
-     * Export collected records as a Chrome Trace Event JSON (swimlane view).
-     * Writes <output_prefix>/l2_swimlane_records.json — directory is captured at
-     * initialize() time.
-     *
-     * @return 0 on success, error code on failure
-     */
-    int export_swimlane_json();
-
-    /**
-     * Free all device memory and unregister mappings. Idempotent on a
-     * collector that was never initialized.
-     *
-     * @param unregister_cb  Memory unregister callback (nullptr in sim mode)
-     * @param free_cb        Memory free callback
-     * @param user_data      Opaque pointer forwarded to callbacks
-     * @return 0 on success, error code on failure
-     */
-    int finalize(L2SwimlaneUnregisterCallback unregister_cb, const L2SwimlaneFreeCallback &free_cb);
-
-    /**
-     * @return true if initialize() succeeded and finalize() has not run.
-     */
-    bool is_initialized() const { return shm_host_ != nullptr; }
-
-    /**
-     * Device pointer to the L2SwimlaneDataHeader. Set kernel_args.l2_swimlane_data_base
-     * to this after initialize() succeeds so the AICPU side can find the
-     * shared memory.
-     */
-    void *get_l2_swimlane_setup_device_ptr() const { return perf_shared_mem_dev_; }
-
-    /**
-     * Device pointer to a uint64_t[num_aicore] table where each entry will
-     * hold this core's `&L2SwimlaneAicoreTaskPool::rotation` device address. Host
-     * only allocates the bytes here; AICPU populates the entries inside
-     * `l2_swimlane_aicpu_init`. Freed by finalize(). Set kernel_args.l2_swimlane_aicore_rotation_table
-     * to this so the AICore kernel entry can index by block_idx and feed the
-     * per-core rotation channel into `set_l2_swimlane_aicore_head_slot()`. Returns
-     * nullptr before initialize() succeeds.
-     */
-    void *get_aicore_ring_addr_table_device_ptr() const { return aicore_ring_addr_table_dev_; }
-
-    /**
-     * Read AICPU phase metadata that lives in L2SwimlaneDataHeader (not on the
-     * buffer pipeline): the core→thread mapping plus a has-data signal
-     * derived from accumulated per-event records. Single-shot — must be
-     * called after stop() so the shm region has settled.
-     */
-    void read_phase_header_metadata();
-
-    /**
-     * Sum per-core / per-thread total_record_count and dropped_record_count
-     * for both the PERF and PHASE pools, cross-check
-     * `collected + dropped == device_total`, and LOG_ERROR any non-zero
-     * current_buf_ptr (which would indicate a device-side flush failure that
-     * left a buffer un-enqueued — see .claude/rules/discipline.md).
-     * The PHASE block is skipped silently when no phase activity was
-     * recorded (runtimes that don't emit phase records). Must be called
-     * after stop().
-     */
-    void reconcile_counters();
-
-    /**
-     * @return Per-core L2SwimlaneAicpuTaskRecord vectors (indexed by core_index). For tests.
-     */
-    const std::vector<std::vector<L2SwimlaneAicpuTaskRecord>> &get_records() const { return collected_perf_records_; }
-
-private:
-    // Shared memory pointers. shm_host_ / device_id_ live on ProfilerBase
-    // (set via set_memory_context in initialize()).
-    void *perf_shared_mem_dev_{nullptr};
-
-    // Standalone uint64_t[num_aicore] table holding per-core L2SwimlaneAicoreTaskBuffer
-    // addresses. Allocated in initialize(), freed in finalize(). AICore reads
-    // ring_table[block_idx] via KernelArgs::l2_swimlane_aicore_rotation_table.
-    void *aicore_ring_addr_table_dev_{nullptr};
-
-    int num_aicore_{0};
-    // Total AICPU threads launched this run. The dedicated orchestrator runs on
-    // the last one (aicpu_thread_num_ - 1); used to report its thread number in
-    // the phase-metadata log (orch-phase is a single pool, so its index alone
-    // does not encode the AICPU thread).
-    int aicpu_thread_num_{0};
-    L2SwimlaneLevel l2_swimlane_level_{L2SwimlaneLevel::DISABLED};
-
-    // Per-core core_type table populated by set_core_types(). Indexed by
-    // core_id; size matches num_aicore_ once populated. Used by the level=1
-    // emit path which has no AICPU record to read core_type from.
-    std::vector<CoreType> core_types_;
-
-    // Per-task output directory captured at initialize() time. Consumed by
-    // export_swimlane_json() to build <prefix>/l2_swimlane_records.json.
-    std::string output_prefix_;
-
-    // Collected data (per-core vectors, indexed by core_index)
-    std::vector<std::vector<L2SwimlaneAicpuTaskRecord>> collected_perf_records_;
-
-    // Collected AICore records (per-core vectors). Each entry is a full
-    // L2SwimlaneAicoreTaskRecord captured from a rotated L2SwimlaneAicoreTaskBuffer. The
-    // order across rotations is preserved by `copy_aicore_buffer` (we sort
-    // incoming buffers by buffer_seq before flattening).
-    std::vector<std::vector<L2SwimlaneAicoreTaskRecord>> collected_aicore_records_;
-
-    // AICPU phase profiling data — separate per-thread vectors for sched and
-    // orch records (kind-tagged at routing time; no parse-time discrimination).
-    std::vector<std::vector<L2SwimlaneAicpuSchedPhaseRecord>> collected_sched_phase_records_;
-    std::vector<std::vector<L2SwimlaneAicpuOrchPhaseRecord>> collected_orch_phase_records_;
-    std::atomic<bool> has_phase_data_{false};
-
-    // Core-to-thread mapping (core_id → scheduler thread index, -1 = unassigned)
-    std::vector<int8_t> core_to_thread_;
-
-    // Running totals used at reconcile time to cross-check device-side counters.
-    std::atomic<uint64_t> total_perf_collected_{0};
-    std::atomic<uint64_t> total_sched_phase_collected_{0};
-    std::atomic<uint64_t> total_orch_phase_collected_{0};
-
-    std::array<std::mutex, PLATFORM_MAX_CORES> perf_record_mutexes_;
-    std::array<std::mutex, PLATFORM_MAX_CORES> aicore_record_mutexes_;
-    std::array<std::mutex, PLATFORM_MAX_AICPU_THREADS> sched_phase_record_mutexes_;
-    std::array<std::mutex, PLATFORM_MAX_AICPU_THREADS> orch_phase_record_mutexes_;
-
-    // Allocate a single buffer (any of the L2SwimlaneAicpu*Buffer kinds) and register it.
-    // The RAII counterpart ``release_one_buffer`` lives on ProfilerBase and
-    // is shared with every other collector.
-    void *alloc_single_buffer(size_t size, void **host_ptr_out);
-
-    // Per-buffer-kind handlers used by on_buffer_collected.
-    void copy_perf_buffer(const ReadyBufferInfo &info);
-    void copy_sched_phase_buffer(const ReadyBufferInfo &info);
-    void copy_orch_phase_buffer(const ReadyBufferInfo &info);
-    void copy_aicore_buffer(const ReadyBufferInfo &info);
-};
-
-#endif  // SRC_A2A3_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
diff --git a/src/a2a3/platform/onboard/host/CMakeLists.txt b/src/a2a3/platform/onboard/host/CMakeLists.txt
index 84e6f3f3c..dc4871139 100644
--- a/src/a2a3/platform/onboard/host/CMakeLists.txt
+++ b/src/a2a3/platform/onboard/host/CMakeLists.txt
@@ -57,11 +57,11 @@ list(APPEND HOST_RUNTIME_SOURCES
     "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/host_regs.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/comm_hccl.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp"
 )
 # Add common/aicpu_loader/host sources (LoadAicpuOp)
diff --git a/src/a2a3/platform/shared/host/dep_gen_collector.cpp b/src/a2a3/platform/shared/host/dep_gen_collector.cpp
deleted file mode 100644
index af549ccf8..000000000
--- a/src/a2a3/platform/shared/host/dep_gen_collector.cpp
+++ /dev/null
@@ -1,275 +0,0 @@
-/*
- * Copyright (c) PyPTO Contributors.
- * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
- * CANN Open Software License Agreement Version 2.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- * -----------------------------------------------------------------------------------------------------------
- */
-
-/**
- * @file dep_gen_collector.cpp
- * @brief Host-side dep_gen collector. The mgmt-thread + buffer-pool machinery
- *        lives in profiling_common::BufferPoolManager parameterized by
- *        DepGenModule (host/dep_gen_collector.h); this file owns the
- *        per-buffer on_buffer_collected callback (in-memory append) and the
- *        device-side cross-check. Records stay in ``records_`` and are
- *        consumed directly by the host replay — no on-disk submit_trace.bin
- *        intermediary.
- */
-
-#include "host/dep_gen_collector.h"
-
-#include <cassert>
-#include <cstring>
-#include <unordered_set>
-
-#include "common/memory_barrier.h"
-#include "common/unified_log.h"
-
-DepGenCollector::~DepGenCollector() { stop(); }
-
-// ---------------------------------------------------------------------------
-// init
-// ---------------------------------------------------------------------------
-
-int DepGenCollector::init(
-    int num_threads, const DepGenAllocCallback &alloc_cb, DepGenRegisterCallback register_cb,
-    const DepGenFreeCallback &free_cb, int device_id
-) {
-    if (num_threads <= 0 || alloc_cb == nullptr || free_cb == nullptr) {
-        LOG_ERROR("DepGenCollector::init: invalid arguments");
-        return -1;
-    }
-
-    num_threads_ = num_threads;
-    buffers_registered_ = (register_cb != nullptr);
-    total_collected_ = 0;
-    records_.clear();
-    execution_complete_.store(false, std::memory_order_release);
-
-    // ---- Allocate shared header + buffer-state region ----
-    // dep_gen is single-instance: just one DepGenBufferState after the header.
-    const int num_instances = 1;
-    shm_size_ = calc_dep_gen_shm_size(num_instances);
-    shm_dev_ = alloc_cb(shm_size_);
-    if (shm_dev_ == nullptr) {
-        LOG_ERROR("DepGenCollector: failed to allocate dep_gen shared memory (%zu bytes)", shm_size_);
-        return -1;
-    }
-
-    if (register_cb != nullptr) {
-        int rc = register_cb(shm_dev_, shm_size_, device_id, &shm_host_);
-        if (rc != 0) {
-            LOG_ERROR("DepGenCollector: halHostRegister for dep_gen SHM failed: %d", rc);
-            free_cb(shm_dev_);
-            shm_dev_ = nullptr;
-            return rc;
-        }
-        shm_registered_ = true;
-    } else {
-        shm_host_ = shm_dev_;
-    }
-    std::memset(shm_host_, 0, shm_size_);
-
-    DepGenDataHeader *hdr = get_dep_gen_header(shm_host_);
-    hdr->num_instances = static_cast<uint32_t>(num_instances);
-
-    // ---- Allocate DepGenBuffers, populate free_queue + recycled pool ----
-    const size_t buf_size = sizeof(DepGenBuffer);
-    DepGenBufferState *state = dep_gen_state(0);
-
-    for (int b = 0; b < PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE; b++) {
-        void *dev_ptr = alloc_cb(buf_size);
-        if (dev_ptr == nullptr) {
-            LOG_ERROR("DepGenCollector: failed to allocate DepGenBuffer b=%d", b);
-            return -1;
-        }
-
-        void *host_ptr = dev_ptr;
-        if (register_cb != nullptr) {
-            int rc = register_cb(dev_ptr, buf_size, device_id, &host_ptr);
-            if (rc != 0) {
-                LOG_ERROR("DepGenCollector: halHostRegister for DepGenBuffer b=%d failed: %d", b, rc);
-                free_cb(dev_ptr);
-                return rc;
-            }
-        }
-        std::memset(host_ptr, 0, buf_size);
-
-        manager_.register_mapping(dev_ptr, host_ptr);
-
-        if (b < PLATFORM_DEP_GEN_SLOT_COUNT) {
-            uint32_t tail = state->free_queue.tail;
-            assert(tail - state->free_queue.head < PLATFORM_DEP_GEN_SLOT_COUNT && "free_queue overflow on init");
-            state->free_queue.buffer_ptrs[tail % PLATFORM_DEP_GEN_SLOT_COUNT] = reinterpret_cast<uint64_t>(dev_ptr);
-            wmb();
-            state->free_queue.tail = tail + 1;
-            wmb();
-        } else {
-            manager_.push_recycled(0, dev_ptr);
-        }
-    }
-
-    initialized_ = true;
-    set_memory_context(
-        alloc_cb, register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr, shm_dev_, shm_host_, shm_size_,
-        device_id
-    );
-
-    LOG_INFO_V0(
-        "DepGen collector initialized: %d threads, SHM=0x%lx (records held in memory until replay)", num_threads,
-        reinterpret_cast<unsigned long>(shm_dev_)
-    );
-    return 0;
-}
-
-// ---------------------------------------------------------------------------
-// Record accumulation (in-memory — no disk hop)
-// ---------------------------------------------------------------------------
-
-void DepGenCollector::append_buffer_records(const void *buf_host_ptr) {
-    const DepGenBuffer *buf = reinterpret_cast<const DepGenBuffer *>(buf_host_ptr);
-    uint32_t n = buf->count;
-    if (n > static_cast<uint32_t>(PLATFORM_DEP_GEN_RECORDS_PER_BUFFER)) {
-        n = static_cast<uint32_t>(PLATFORM_DEP_GEN_RECORDS_PER_BUFFER);
-    }
-    if (n == 0) return;
-
-    std::scoped_lock lock(records_mutex_);
-    records_.insert(records_.end(), buf->records, buf->records + n);
-    total_collected_ += n;
-}
-
-// ---------------------------------------------------------------------------
-// ProfilerBase callback
-// ---------------------------------------------------------------------------
-
-void DepGenCollector::on_buffer_collected(const DepGenReadyBufferInfo &info) {
-    append_buffer_records(info.host_buffer_ptr);
-}
-
-// ---------------------------------------------------------------------------
-// reconcile_counters
-// ---------------------------------------------------------------------------
-
-bool DepGenCollector::reconcile_counters() {
-    if (shm_host_ == nullptr) return false;
-
-    rmb();
-
-    bool clean = true;
-
-    DepGenBufferState *state = dep_gen_state(0);
-    uint64_t buf_dev = state->current_buf_ptr;
-    if (buf_dev != 0) {
-        void *host_ptr = manager_.resolve_host_ptr(reinterpret_cast<void *>(buf_dev));
-        if (host_ptr != nullptr) {
-            uint32_t count = reinterpret_cast<const DepGenBuffer *>(host_ptr)->count;
-            if (count != 0) {
-                LOG_ERROR(
-                    "dep_gen reconcile: un-flushed buffer (current_buf_ptr=0x%lx, count=%u) — device flush failed",
-                    static_cast<unsigned long>(buf_dev), count
-                );
-                clean = false;
-            }
-        }
-    }
-
-    uint64_t total_device = state->total_record_count;
-    uint64_t dropped_device = state->dropped_record_count;
-    uint64_t overflow_device = state->total_overflow_record_count;
-
-    if (dropped_device > 0) {
-        LOG_WARN(
-            "dep_gen reconcile: %lu records dropped on device side (free_queue empty or ready_queue full). "
-            "Increase PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE / PLATFORM_DEP_GEN_READYQUEUE_SIZE if frequent. "
-            "deps.json will NOT be emitted for this run (incomplete graph).",
-            static_cast<unsigned long>(dropped_device)
-        );
-        clean = false;
-    }
-    // collected counts physical buffer slots; total_device counts submits; the
-    // chain expands submits into multiple slots, so the overflow counter
-    // bridges the two.
-    if (total_collected_ + dropped_device != total_device + overflow_device) {
-        LOG_WARN(
-            "dep_gen reconcile: record count mismatch (collected=%lu + dropped=%lu != device_total=%lu + "
-            "overflow=%lu, silent_loss=%ld)",
-            static_cast<unsigned long>(total_collected_), static_cast<unsigned long>(dropped_device),
-            static_cast<unsigned long>(total_device), static_cast<unsigned long>(overflow_device),
-            static_cast<long>(total_device + overflow_device) - static_cast<long>(total_collected_ + dropped_device)
-        );
-        clean = false;
-    } else {
-        LOG_INFO_V0(
-            "dep_gen reconcile: counts match (collected=%lu, dropped=%lu, device_total=%lu, overflow=%lu)",
-            static_cast<unsigned long>(total_collected_), static_cast<unsigned long>(dropped_device),
-            static_cast<unsigned long>(total_device), static_cast<unsigned long>(overflow_device)
-        );
-    }
-
-    return clean;
-}
-
-// ---------------------------------------------------------------------------
-// finalize
-// ---------------------------------------------------------------------------
-
-void DepGenCollector::finalize(DepGenUnregisterCallback unregister_cb, const DepGenFreeCallback &free_cb) {
-    if (!initialized_) return;
-
-    stop();
-
-    {
-        std::scoped_lock lock(records_mutex_);
-        records_.clear();
-        records_.shrink_to_fit();
-    }
-
-    // Same pattern as PmuCollector: walk owned buffers, then the free_queue
-    // and current_buf_ptr, releasing each unique device pointer once.
-    auto release_buf = [&](void *p) {
-        release_one_buffer(p, buffers_registered_ ? unregister_cb : nullptr, free_cb);
-    };
-    manager_.release_owned_buffers(release_buf);
-
-    if (shm_host_ != nullptr) {
-        std::unordered_set<void *> already_freed;
-        auto release_unique = [&](void *p) {
-            if (p == nullptr || !already_freed.insert(p).second) return;
-            release_buf(p);
-        };
-        DepGenBufferState *state = dep_gen_state(0);
-        release_unique(reinterpret_cast<void *>(state->current_buf_ptr));
-        state->current_buf_ptr = 0;
-        rmb();
-        uint32_t head = state->free_queue.head;
-        uint32_t tail = state->free_queue.tail;
-        uint32_t queued = tail - head;
-        if (queued > PLATFORM_DEP_GEN_SLOT_COUNT) queued = PLATFORM_DEP_GEN_SLOT_COUNT;
-        for (uint32_t i = 0; i < queued; i++) {
-            uint32_t slot = (head + i) % PLATFORM_DEP_GEN_SLOT_COUNT;
-            release_unique(reinterpret_cast<void *>(state->free_queue.buffer_ptrs[slot]));
-            state->free_queue.buffer_ptrs[slot] = 0;
-        }
-        state->free_queue.head = tail;
-    }
-    manager_.clear_mappings();
-
-    if (shm_dev_ != nullptr) {
-        release_one_buffer(shm_dev_, shm_registered_ ? unregister_cb : nullptr, free_cb);
-        shm_dev_ = nullptr;
-        shm_host_ = nullptr;
-    }
-
-    initialized_ = false;
-    buffers_registered_ = false;
-    shm_registered_ = false;
-    shm_size_ = 0;
-    total_collected_ = 0;
-    clear_memory_context();
-    LOG_INFO_V0("DepGen collector finalized");
-}
diff --git a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp b/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
deleted file mode 100644
index 6572a498b..000000000
--- a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
+++ /dev/null
@@ -1,1036 +0,0 @@
-/*
- * Copyright (c) PyPTO Contributors.
- * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
- * CANN Open Software License Agreement Version 2.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- * -----------------------------------------------------------------------------------------------------------
- */
-
-/**
- * @file l2_swimlane_collector.cpp
- * @brief Performance data collector implementation. The mgmt-thread + buffer-pool
- *        machinery lives in profiling_common::BufferPoolManager parameterized by
- *        L2SwimlaneModule (host/l2_swimlane_collector.h); the poll loop lives in
- *        profiling_common::ProfilerBase. This file owns the per-buffer
- *        on_buffer_collected callback and the export logic.
- */
-
-#include "host/l2_swimlane_collector.h"
-
-#include <cinttypes>
-#include <cstdlib>
-#include <ctime>
-#include <filesystem>
-#include <fstream>
-#include <string>
-#include <vector>
-
-#include "common/memory_barrier.h"
-#include "common/unified_log.h"
-
-// =============================================================================
-// L2SwimlaneCollector Implementation
-// =============================================================================
-
-// Sched / orch phase records route through separate BufferKinds; no
-// parse-time discriminator function is needed (the device-side type tag is
-// the source of truth).
-
-L2SwimlaneCollector::~L2SwimlaneCollector() {
-    stop();
-    if (shm_host_ != nullptr) {
-        LOG_WARN("L2SwimlaneCollector destroyed without finalize()");
-    }
-}
-
-void *L2SwimlaneCollector::alloc_single_buffer(size_t size, void **host_ptr_out) {
-    void *dev_ptr = alloc_cb_(size);
-    if (dev_ptr == nullptr) {
-        LOG_ERROR("Failed to allocate buffer (%zu bytes)", size);
-        *host_ptr_out = nullptr;
-        return nullptr;
-    }
-
-    if (register_cb_ != nullptr) {
-        void *host_ptr = nullptr;
-        int rc = register_cb_(dev_ptr, size, device_id_, &host_ptr);
-        if (rc != 0 || host_ptr == nullptr) {
-            LOG_ERROR("Buffer registration failed: %d", rc);
-            *host_ptr_out = nullptr;
-            return nullptr;
-        }
-        *host_ptr_out = host_ptr;
-    } else {
-        *host_ptr_out = dev_ptr;
-    }
-
-    // Register mapping so the BufferPoolManager can resolve dev→host
-    manager_.register_mapping(dev_ptr, *host_ptr_out);
-    return dev_ptr;
-}
-
-int L2SwimlaneCollector::initialize(
-    int num_aicore, int aicpu_thread_num, int device_id, L2SwimlaneLevel l2_swimlane_level,
-    const L2SwimlaneAllocCallback &alloc_cb, L2SwimlaneRegisterCallback register_cb,
-    const L2SwimlaneFreeCallback &free_cb, const std::string &output_prefix
-) {
-    if (shm_host_ != nullptr) {
-        LOG_ERROR("L2SwimlaneCollector already initialized");
-        return -1;
-    }
-
-    LOG_INFO_V0("Initializing performance profiling");
-
-    if (num_aicore <= 0 || num_aicore > PLATFORM_MAX_CORES) {
-        LOG_ERROR("Invalid number of AICores: %d (max=%d)", num_aicore, PLATFORM_MAX_CORES);
-        return -1;
-    }
-
-    num_aicore_ = num_aicore;
-    aicpu_thread_num_ = aicpu_thread_num;
-    l2_swimlane_level_ = l2_swimlane_level;
-    output_prefix_ = output_prefix;
-    total_perf_collected_.store(0, std::memory_order_relaxed);
-    total_sched_phase_collected_.store(0, std::memory_order_relaxed);
-    total_orch_phase_collected_.store(0, std::memory_order_relaxed);
-
-    // Stash the memory context on the base up-front so alloc_single_buffer
-    // sees consistent values during init. shm_host_ stays nullptr until the
-    // shm allocation succeeds — the nullptr guard makes a post-failure
-    // start(tf) a no-op.
-    set_memory_context(
-        alloc_cb, register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr, /*shm_dev=*/nullptr,
-        /*shm_host=*/nullptr, /*shm_size=*/0, device_id
-    );
-
-    // Step 1: Calculate shared memory size (slot arrays only, no actual
-    // buffers). Host over-allocates phase pool slots to the platform max for
-    // both sched and orch — AICPU picks the actual counts at init_phase time
-    // and writes them into the header.
-    int num_phase_threads = PLATFORM_MAX_AICPU_THREADS;
-    size_t total_size = calc_perf_data_size_with_phases(num_aicore, num_phase_threads, num_phase_threads);
-
-    LOG_DEBUG("Shared memory allocation plan:");
-    LOG_DEBUG("  Number of cores:      %d", num_aicore);
-    LOG_DEBUG("  Header size:          %zu bytes", sizeof(L2SwimlaneDataHeader));
-    LOG_DEBUG("  L2SwimlaneAicpuTaskPool size: %zu bytes each", sizeof(L2SwimlaneAicpuTaskPool));
-    LOG_DEBUG("  L2SwimlaneAicpuSchedPhasePool size: %zu bytes each", sizeof(L2SwimlaneAicpuSchedPhasePool));
-    LOG_DEBUG("  L2SwimlaneAicpuOrchPhasePool size:  %zu bytes each", sizeof(L2SwimlaneAicpuOrchPhasePool));
-    LOG_DEBUG("  Total shared memory:  %zu bytes (%zu KB)", total_size, total_size / 1024);
-
-    // Step 2: Allocate shared memory for slot arrays
-    void *perf_dev_ptr = alloc_cb(total_size);
-    if (perf_dev_ptr == nullptr) {
-        LOG_ERROR("Failed to allocate shared memory (%zu bytes)", total_size);
-        return -1;
-    }
-    LOG_DEBUG("Allocated shared memory: %p", perf_dev_ptr);
-
-    // Step 3: Register to host mapping (optional)
-    void *perf_host_ptr = nullptr;
-    if (register_cb != nullptr) {
-        int rc = register_cb(perf_dev_ptr, total_size, device_id, &perf_host_ptr);
-        if (rc != 0) {
-            LOG_ERROR("Memory registration failed: %d", rc);
-            return rc;
-        }
-        if (perf_host_ptr == nullptr) {
-            LOG_ERROR("register_cb succeeded but returned null host_ptr");
-            return -1;
-        }
-        LOG_DEBUG("Mapped to host memory: %p", perf_host_ptr);
-    } else {
-        perf_host_ptr = perf_dev_ptr;
-        LOG_DEBUG("Simulation mode: host_ptr = dev_ptr = %p", perf_host_ptr);
-    }
-
-    // Step 4: Initialize header
-    L2SwimlaneDataHeader *header = get_l2_swimlane_header(perf_host_ptr);
-
-    for (int t = 0; t < PLATFORM_MAX_AICPU_THREADS; t++) {
-        memset(header->queues[t], 0, sizeof(header->queues[t]));
-        header->queue_heads[t] = 0;
-        header->queue_tails[t] = 0;
-    }
-
-    header->num_cores = num_aicore;
-    header->l2_swimlane_level = static_cast<uint32_t>(l2_swimlane_level_);
-    // Phase metadata: must be zero-initialized here. alloc_cb returns
-    // uninitialized device memory; AICPU only writes these fields when
-    // phase init runs (level >= SCHED_PHASES). Without zeroing, lower
-    // levels (AICORE_TIMING / AICPU_TIMING) leave garbage that
-    // for_each_instance iterates as `num_sched_phase_threads` /
-    // `num_orch_phase_threads`, walking off the end of the allocated pool
-    // array → segfault. The host-side reader (read_phase_header_metadata)
-    // and BufferPoolManager replenish loop both gate on these counts being
-    // sane values.
-    header->num_sched_phase_threads = 0;
-    header->num_orch_phase_threads = 0;
-    header->num_phase_cores = 0;
-    memset(header->core_to_thread, -1, sizeof(header->core_to_thread));
-
-    LOG_DEBUG("Initialized L2SwimlaneDataHeader:");
-    LOG_DEBUG("  num_cores:              %d", header->num_cores);
-    LOG_DEBUG("  l2_swimlane_level: %u", header->l2_swimlane_level);
-    LOG_DEBUG("  buffer_capacity:        %d", PLATFORM_PROF_BUFFER_SIZE);
-    LOG_DEBUG("  queue capacity:         %d", PLATFORM_PROF_READYQUEUE_SIZE);
-
-    // Step 5: Initialize L2SwimlaneAicpuTaskPools. Seed as many buffers as
-    // the device-side free_queue can hold; any remaining buffers stay in the
-    // host recycled pool.
-    for (int i = 0; i < num_aicore; i++) {
-        L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(perf_host_ptr, i);
-        memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool));
-
-        state->free_queue.head = 0;
-        state->free_queue.tail = 0;
-        state->head.current_buf_ptr = 0;
-        state->head.current_buf_seq = 0;
-
-        const int initial_free_count = (PLATFORM_PROF_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ?
-                                           PLATFORM_PROF_BUFFERS_PER_CORE :
-                                           PLATFORM_PROF_SLOT_COUNT;
-        for (int s = 0; s < PLATFORM_PROF_BUFFERS_PER_CORE; s++) {
-            void *host_buf_ptr = nullptr;
-            void *dev_buf_ptr = alloc_single_buffer(sizeof(L2SwimlaneAicpuTaskBuffer), &host_buf_ptr);
-            if (dev_buf_ptr == nullptr) {
-                LOG_ERROR("Failed to allocate L2SwimlaneAicpuTaskBuffer for core %d, buffer %d", i, s);
-                return -1;
-            }
-            L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast<L2SwimlaneAicpuTaskBuffer *>(host_buf_ptr);
-            memset(buf, 0, sizeof(L2SwimlaneAicpuTaskBuffer));
-            buf->count = 0;
-
-            if (s < initial_free_count) {
-                state->free_queue.buffer_ptrs[s] = reinterpret_cast<uint64_t>(dev_buf_ptr);
-            } else {
-                manager_.push_recycled(static_cast<int>(ProfBufferType::AICPU_TASK), dev_buf_ptr);
-            }
-        }
-        wmb();
-        state->free_queue.tail = static_cast<uint32_t>(initial_free_count);
-        wmb();
-    }
-
-    // Step 5b: Initialize L2SwimlaneAicoreTaskPools — per-core AICore rotation
-    // channel + buffer pool. Same SPSC pattern as the AICPU pool above.
-    for (int i = 0; i < num_aicore; i++) {
-        L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(perf_host_ptr, num_aicore, i);
-        memset(ac_state, 0, sizeof(L2SwimlaneAicoreTaskPool));
-
-        const int initial_free_count = (PLATFORM_AICORE_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ?
-                                           PLATFORM_AICORE_BUFFERS_PER_CORE :
-                                           PLATFORM_PROF_SLOT_COUNT;
-        for (int s = 0; s < PLATFORM_AICORE_BUFFERS_PER_CORE; s++) {
-            void *host_buf_ptr = nullptr;
-            void *dev_buf_ptr = alloc_single_buffer(sizeof(L2SwimlaneAicoreTaskBuffer), &host_buf_ptr);
-            if (dev_buf_ptr == nullptr) {
-                LOG_ERROR("Failed to allocate L2SwimlaneAicoreTaskBuffer for core %d, buffer %d", i, s);
-                return -1;
-            }
-            L2SwimlaneAicoreTaskBuffer *buf = reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(host_buf_ptr);
-            memset(buf, 0, sizeof(L2SwimlaneAicoreTaskBuffer));
-            buf->count = 0;
-
-            if (s < initial_free_count) {
-                ac_state->free_queue.buffer_ptrs[s] = reinterpret_cast<uint64_t>(dev_buf_ptr);
-            } else {
-                manager_.push_recycled(static_cast<int>(ProfBufferType::AICORE_TASK), dev_buf_ptr);
-            }
-        }
-        wmb();
-        ac_state->free_queue.tail = static_cast<uint32_t>(initial_free_count);
-        wmb();
-    }
-    LOG_DEBUG(
-        "Initialized buffer pools: %d L2SwimlaneAicpuTaskBuffers/core + %d L2SwimlaneAicoreTaskBuffers/core (up to "
-        "%d in free_queue, rest in recycled pool)",
-        PLATFORM_PROF_BUFFERS_PER_CORE, PLATFORM_AICORE_BUFFERS_PER_CORE, PLATFORM_PROF_SLOT_COUNT
-    );
-
-    // Step 5c: Standalone uint64_t[num_aicore] table that will hold per-core
-    // L2SwimlaneActiveHead device addresses. Host only allocates the bytes and
-    // hands the device pointer to AICPU via KernelArgs::l2_swimlane_aicore_rotation_table;
-    // AICPU itself fills the entries inside `l2_swimlane_aicpu_init` (it has
-    // direct access to `&ac_state->head` device addresses, no
-    // host-to-device translation needed). AICore reads
-    // rotation_table[block_idx] at kernel entry.
-    {
-        size_t table_bytes = static_cast<size_t>(num_aicore) * sizeof(uint64_t);
-        void *rotation_table_host = nullptr;
-        void *rotation_table_dev = alloc_single_buffer(table_bytes, &rotation_table_host);
-        if (rotation_table_dev == nullptr) {
-            LOG_ERROR("Failed to allocate l2_swimlane_aicore_rotation_table (rotation) table (%zu bytes)", table_bytes);
-            return -1;
-        }
-        aicore_ring_addr_table_dev_ = rotation_table_dev;
-    }
-
-    // Step 6: Initialize per-thread phase pools — both sched and orch. Each
-    // pool is sized to its own PLATFORM_PROF_{SCHED,ORCH}_BUFFERS_PER_THREAD
-    // (seeded into free_queue up to slot capacity, rest in the recycled pool
-    // tagged by kind). Templated on the concrete TypedBuffer so the `count`
-    // zero-store uses the matching layout — sched and orch buffers have
-    // DIFFERENT sizes (64B vs 32B records),
-    // so a single cast type for both would land the count store past the end
-    // of the orch allocation and corrupt the heap.
-    // state_count pool states are zeroed (so the host's [0, PLATFORM_MAX)
-    // reconcile/iteration reads count=0 for unused slots); buffers are
-    // allocated only for the first buffer_count pools. For sched the two are
-    // equal; orch is a single instance (pool 0), so it zeroes all slots but
-    // allocates buffers for just pool 0 — no buffers wasted on unused slots.
-    auto init_phase_pools = [&](auto buffer_tag, L2SwimlaneAicpuTaskPool *(*get_state)(void *, int, int),
-                                int state_count, int buffer_count, int buffers_per_thread, ProfBufferType recycle_kind,
-                                const char *kind_label) -> int {
-        using Buffer = typename decltype(buffer_tag)::type;
-        constexpr size_t buffer_bytes = sizeof(Buffer);
-        for (int t = 0; t < state_count; t++) {
-            auto *state = get_state(perf_host_ptr, num_aicore, t);
-            memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool));
-            if (t >= buffer_count) continue;  // zeroed state only; no buffers (unused slot)
-            const int initial_free_count =
-                (buffers_per_thread < PLATFORM_PROF_SLOT_COUNT) ? buffers_per_thread : PLATFORM_PROF_SLOT_COUNT;
-            for (int s = 0; s < buffers_per_thread; s++) {
-                void *host_buf_ptr = nullptr;
-                void *dev_buf_ptr = alloc_single_buffer(buffer_bytes, &host_buf_ptr);
-                if (dev_buf_ptr == nullptr) {
-                    LOG_ERROR("Failed to allocate %s phase buffer for thread %d, slot %d", kind_label, t, s);
-                    return -1;
-                }
-                // Zero only the `count` word at the buffer's tail, using the
-                // matching Buffer type. The records payload is overwritten by
-                // AICPU on first use.
-                reinterpret_cast<Buffer *>(host_buf_ptr)->count = 0;
-                if (s < initial_free_count) {
-                    state->free_queue.buffer_ptrs[s] = reinterpret_cast<uint64_t>(dev_buf_ptr);
-                } else {
-                    manager_.push_recycled(static_cast<int>(recycle_kind), dev_buf_ptr);
-                }
-            }
-            wmb();
-            state->free_queue.tail = static_cast<uint32_t>(initial_free_count);
-            wmb();
-        }
-        return 0;
-    };
-
-    // Type tags so the templated lambda can deduce the buffer type without
-    // having to spell out an explicit template argument (not portable on a
-    // generic lambda before C++20 explicit template-parameter syntax).
-    struct SchedTag {
-        using type = L2SwimlaneAicpuSchedPhaseBuffer;
-    };
-    struct OrchTag {
-        using type = L2SwimlaneAicpuOrchPhaseBuffer;
-    };
-
-    // Sched: actual scheduler-thread count is unknown at host-alloc time, so
-    // size buffers to the platform max. Orch: a single instance (pool 0), so
-    // allocate buffers for just one pool while still zeroing all MAX states.
-    if (init_phase_pools(
-            SchedTag{}, get_sched_phase_buffer_state, /*state_count=*/num_phase_threads,
-            /*buffer_count=*/num_phase_threads, /*buffers_per_thread=*/PLATFORM_PROF_SCHED_BUFFERS_PER_THREAD,
-            ProfBufferType::AICPU_SCHED_PHASE, "sched"
-        ) != 0) {
-        return -1;
-    }
-    auto orch_get_state = [](void *base, int n_cores, int t) {
-        return get_orch_phase_buffer_state(base, n_cores, t);
-    };
-    if (init_phase_pools(
-            OrchTag{}, orch_get_state, /*state_count=*/num_phase_threads, /*buffer_count=*/1,
-            /*buffers_per_thread=*/PLATFORM_PROF_ORCH_BUFFERS_PER_THREAD, ProfBufferType::AICPU_ORCH_PHASE, "orch"
-        ) != 0) {
-        return -1;
-    }
-    LOG_DEBUG(
-        "Initialized %d sched (%d buf/thread) + 1 orch (%d buf) PhaseBufferStates (seeded up to %d free_queue "
-        "slots)",
-        num_phase_threads, PLATFORM_PROF_SCHED_BUFFERS_PER_THREAD, PLATFORM_PROF_ORCH_BUFFERS_PER_THREAD,
-        PLATFORM_PROF_SLOT_COUNT
-    );
-
-    wmb();
-
-    // Step 7: Stash device pointer for the caller to publish via
-    // kernel_args.l2_swimlane_data_base (read back via get_l2_swimlane_setup_device_ptr()).
-    LOG_DEBUG("L2 swimlane device base = 0x%lx", reinterpret_cast<uint64_t>(perf_dev_ptr));
-
-    perf_shared_mem_dev_ = perf_dev_ptr;
-    // Refresh memory context with the now-known SHM tuple. start(tf) (inherited)
-    // gates on shm_host_, so this is the moment the collector becomes startable.
-    set_memory_context(
-        alloc_cb, register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr, perf_dev_ptr, perf_host_ptr,
-        total_size, device_id
-    );
-
-    collected_perf_records_.assign(num_aicore_, {});
-    collected_aicore_records_.assign(num_aicore_, {});
-    collected_sched_phase_records_.assign(PLATFORM_MAX_AICPU_THREADS, {});
-    collected_orch_phase_records_.assign(PLATFORM_MAX_AICPU_THREADS, {});
-
-    LOG_INFO_V0("Performance profiling initialized (dynamic buffer mode)");
-    return 0;
-}
-
-// ---------------------------------------------------------------------------
-// ProfilerBase callbacks
-// ---------------------------------------------------------------------------
-
-void L2SwimlaneCollector::copy_perf_buffer(const ReadyBufferInfo &info) {
-    L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast<L2SwimlaneAicpuTaskBuffer *>(info.host_buffer_ptr);
-    rmb();
-    uint32_t count = buf->count;
-    if (count > PLATFORM_PROF_BUFFER_SIZE) {
-        count = PLATFORM_PROF_BUFFER_SIZE;
-    }
-    uint32_t core_index = info.index;
-    if (core_index < static_cast<uint32_t>(num_aicore_)) {
-        std::scoped_lock<std::mutex> lock(perf_record_mutexes_[core_index]);
-        for (uint32_t i = 0; i < count; i++) {
-            collected_perf_records_[core_index].push_back(buf->records[i]);
-        }
-        total_perf_collected_.fetch_add(count, std::memory_order_relaxed);
-    }
-}
-
-void L2SwimlaneCollector::copy_sched_phase_buffer(const ReadyBufferInfo &info) {
-    auto *buf = reinterpret_cast<L2SwimlaneAicpuSchedPhaseBuffer *>(info.host_buffer_ptr);
-    rmb();
-    uint32_t count = buf->count;
-    if (count > static_cast<uint32_t>(PLATFORM_PHASE_RECORDS_PER_THREAD)) {
-        count = PLATFORM_PHASE_RECORDS_PER_THREAD;
-    }
-    uint32_t tidx = info.index;
-    if (tidx < collected_sched_phase_records_.size()) {
-        std::scoped_lock<std::mutex> lock(sched_phase_record_mutexes_[tidx]);
-        for (uint32_t i = 0; i < count; i++) {
-            collected_sched_phase_records_[tidx].push_back(buf->records[i]);
-        }
-        total_sched_phase_collected_.fetch_add(count, std::memory_order_relaxed);
-        if (count > 0) {
-            has_phase_data_.store(true, std::memory_order_relaxed);
-        }
-    }
-}
-
-void L2SwimlaneCollector::copy_orch_phase_buffer(const ReadyBufferInfo &info) {
-    auto *buf = reinterpret_cast<L2SwimlaneAicpuOrchPhaseBuffer *>(info.host_buffer_ptr);
-    rmb();
-    uint32_t count = buf->count;
-    if (count > static_cast<uint32_t>(PLATFORM_PHASE_RECORDS_PER_THREAD)) {
-        count = PLATFORM_PHASE_RECORDS_PER_THREAD;
-    }
-    uint32_t tidx = info.index;
-    if (tidx < collected_orch_phase_records_.size()) {
-        std::scoped_lock<std::mutex> lock(orch_phase_record_mutexes_[tidx]);
-        for (uint32_t i = 0; i < count; i++) {
-            collected_orch_phase_records_[tidx].push_back(buf->records[i]);
-        }
-        total_orch_phase_collected_.fetch_add(count, std::memory_order_relaxed);
-        if (count > 0) {
-            has_phase_data_.store(true, std::memory_order_relaxed);
-        }
-    }
-}
-
-// AICore record buffers arrive on the ready queue in per-core rotation order
-// (AICPU enqueues them at PLATFORM_AICORE_BUFFER_SIZE dispatch boundaries +
-// once at flush). Within a single buffer, AICore wrote records[0..buf->count)
-// in the order tasks ran on that core (completion-before-dispatch invariant
-// + AICPU stamps buf->count just before enqueue). Flattening in arrival
-// order gives us the per-core task stream that join_aicore_records()
-// indexes by reg_task_id.
-//
-// Defensive filter: skip records whose `start_time == 0`. AICore writes
-// `get_sys_cnt_aicore()` (a free-running cycle counter, always non-zero in
-// practice) at task end, so a zero start_time means the slot was never
-// written by AICore for this session. This handles two edge cases without
-// special-casing them:
-//   - Recycled buffer where AICore wrote fewer records than the count stamp
-//     (e.g., the rare dispatch-boundary race for sub-microsecond kernels
-//     where AICore's next record_task fires before AICPU's rotation has
-//     propagated). The "missing" slot's previous contents are zero because
-//     allocate_single_buffer memsets at allocation.
-//   - Flush-path partial buffer whose tail wasn't reached.
-void L2SwimlaneCollector::copy_aicore_buffer(const ReadyBufferInfo &info) {
-    L2SwimlaneAicoreTaskBuffer *buf = reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(info.host_buffer_ptr);
-    rmb();
-    uint32_t core_index = info.index;
-    if (core_index >= static_cast<uint32_t>(num_aicore_)) {
-        return;
-    }
-    uint32_t count = buf->count;
-    if (count > static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE)) {
-        count = PLATFORM_AICORE_BUFFER_SIZE;
-    }
-    uint32_t skipped = 0;
-    {
-        std::scoped_lock<std::mutex> lock(aicore_record_mutexes_[core_index]);
-        auto &dst = collected_aicore_records_[core_index];
-        dst.reserve(dst.size() + count);
-        for (uint32_t i = 0; i < count; i++) {
-            const L2SwimlaneAicoreTaskRecord &r = buf->records[i];
-            if (r.start_time == 0) {
-                skipped++;
-                continue;
-            }
-            dst.push_back(r);
-        }
-    }
-    if (skipped > 0) {
-        LOG_WARN(
-            "Core %u: skipped %u AICore record slot(s) with start_time=0 (race-window write or "
-            "recycled-buffer tail). buf seq=%u count=%u",
-            core_index, skipped, info.buffer_seq, count
-        );
-    }
-}
-
-void L2SwimlaneCollector::on_buffer_collected(const ReadyBufferInfo &info) {
-    switch (info.type) {
-    case ProfBufferType::AICPU_TASK:
-        copy_perf_buffer(info);
-        break;
-    case ProfBufferType::AICPU_SCHED_PHASE:
-        copy_sched_phase_buffer(info);
-        break;
-    case ProfBufferType::AICPU_ORCH_PHASE:
-        copy_orch_phase_buffer(info);
-        break;
-    case ProfBufferType::AICORE_TASK:
-        copy_aicore_buffer(info);
-        break;
-    }
-}
-
-// ---------------------------------------------------------------------------
-// reconcile_counters / read_phase_header_metadata
-// ---------------------------------------------------------------------------
-//
-// Host never recovers records from device-side current_buf_ptr. Device flush
-// is the only data path: a flush failure must bump dropped_record_count and
-// clear current_buf_ptr on the device side. Host's job here is purely
-// accounting + sanity check.
-
-void L2SwimlaneCollector::reconcile_counters() {
-    if (shm_host_ == nullptr) {
-        return;
-    }
-
-    rmb();
-
-    // Two-bucket invariant (post-AICore-as-producer): every commit attempt
-    // bumps total_record_count; capacity-driven drops (no free buffer /
-    // queue full / flush failure) bump dropped_record_count.
-    //   silent_loss = device_total - (collected + dropped)
-    // and any non-zero silent loss flags an unaccounted gap on top of the
-    // already-classified dropped losses.
-    //
-    // Sanity sub-check: after stop(), any active buffer with records must
-    // have been flushed by AICPU (success → current_buf_ptr=0; failure →
-    // bump dropped, clear count + current_buf_ptr). A non-zero pointer with
-    // non-zero count means records AICPU neither delivered nor accounted
-    // for — i.e. a device-side flush bug. Empty buffers (count=0, never
-    // written) are fine; AICPU's flush legitimately skips them.
-    auto reconcile_one = [&](const char *kind, const char *unit_name, int unit_count, auto get_state,
-                             auto read_buf_count, uint64_t collected, bool optional) {
-        int leftover_active = 0;
-        for (int i = 0; i < unit_count; i++) {
-            L2SwimlaneAicpuTaskPool *state = get_state(i);
-            uint64_t buf_ptr = state->head.current_buf_ptr;
-            if (buf_ptr == 0) continue;
-            void *host_ptr = manager_.resolve_host_ptr(reinterpret_cast<void *>(buf_ptr));
-            if (host_ptr == nullptr) continue;
-            uint32_t count = read_buf_count(host_ptr);
-            if (count == 0) continue;
-            LOG_ERROR(
-                "L2Swimlane reconcile: %s %d has un-flushed %s buffer (current_buf_ptr=0x%lx, count=%u) "
-                "after stop() — device flush failed",
-                unit_name, i, kind, static_cast<unsigned long>(buf_ptr), count
-            );
-            leftover_active++;
-        }
-
-        uint64_t total_device = 0;
-        uint64_t dropped_device = 0;
-        for (int i = 0; i < unit_count; i++) {
-            L2SwimlaneAicpuTaskPool *state = get_state(i);
-            total_device += state->head.total_record_count;
-            dropped_device += state->head.dropped_record_count;
-        }
-
-        // PHASE counters are populated only by runtimes that actually emit
-        // phase records; skip the comparison entirely when nothing happened.
-        if (optional && total_device == 0 && collected == 0 && dropped_device == 0) {
-            return;
-        }
-
-        if (dropped_device > 0) {
-            LOG_WARN(
-                "L2Swimlane reconcile: %lu %s records dropped on device side.",
-                static_cast<unsigned long>(dropped_device), kind
-            );
-        }
-        uint64_t accounted = collected + dropped_device;
-        if (accounted != total_device) {
-            LOG_WARN(
-                "L2Swimlane reconcile: %s count mismatch (collected=%lu + dropped=%lu != "
-                "device_total=%lu, silent_loss=%ld)",
-                kind, static_cast<unsigned long>(collected), static_cast<unsigned long>(dropped_device),
-                static_cast<unsigned long>(total_device), static_cast<long>(total_device) - static_cast<long>(accounted)
-            );
-        } else {
-            LOG_INFO_V0(
-                "L2Swimlane reconcile: %s counts match (collected=%lu, dropped=%lu, device_total=%lu)", kind,
-                static_cast<unsigned long>(collected), static_cast<unsigned long>(dropped_device),
-                static_cast<unsigned long>(total_device)
-            );
-        }
-
-        if (leftover_active > 0) {
-            LOG_ERROR(
-                "L2Swimlane reconcile: %d %s(s) had un-cleared %s current_buf_ptr — see prior errors", leftover_active,
-                unit_name, kind
-            );
-        }
-    };
-
-    reconcile_one(
-        "PERF", "core", num_aicore_,
-        [this](int core_index) {
-            return get_perf_buffer_state(shm_host_, core_index);
-        },
-        [](void *host_ptr) {
-            return reinterpret_cast<L2SwimlaneAicpuTaskBuffer *>(host_ptr)->count;
-        },
-        total_perf_collected_.load(std::memory_order_relaxed), /*optional=*/false
-    );
-
-    reconcile_one(
-        "SCHED_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS,
-        [this](int thread_index) {
-            return get_sched_phase_buffer_state(shm_host_, num_aicore_, thread_index);
-        },
-        [](void *host_ptr) {
-            return reinterpret_cast<L2SwimlaneAicpuSchedPhaseBuffer *>(host_ptr)->count;
-        },
-        total_sched_phase_collected_.load(std::memory_order_relaxed), /*optional=*/true
-    );
-
-    reconcile_one(
-        "ORCH_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS,
-        [this](int thread_index) {
-            return get_orch_phase_buffer_state(shm_host_, num_aicore_, thread_index);
-        },
-        [](void *host_ptr) {
-            return reinterpret_cast<L2SwimlaneAicpuOrchPhaseBuffer *>(host_ptr)->count;
-        },
-        total_orch_phase_collected_.load(std::memory_order_relaxed), /*optional=*/true
-    );
-}
-
-void L2SwimlaneCollector::read_phase_header_metadata() {
-    if (shm_host_ == nullptr) {
-        return;
-    }
-
-    rmb();
-
-    L2SwimlaneDataHeader *header = get_l2_swimlane_header(shm_host_);
-
-    int num_sched = static_cast<int>(header->num_sched_phase_threads);
-    int num_orch = static_cast<int>(header->num_orch_phase_threads);
-    if (num_sched == 0 && num_orch == 0) {
-        LOG_INFO_V0("No phase profiling data found (sched/orch phase thread counts both 0; phase init never ran)");
-        return;
-    }
-    if (num_sched > PLATFORM_MAX_AICPU_THREADS || num_orch > PLATFORM_MAX_AICPU_THREADS) {
-        LOG_ERROR(
-            "Invalid phase thread counts from shared memory (sched=%d, orch=%d, max=%d)", num_sched, num_orch,
-            PLATFORM_MAX_AICPU_THREADS
-        );
-        return;
-    }
-    // Scheduler threads occupy AICPU threads [0, num_sched); the dedicated
-    // orchestrator runs on the last AICPU thread (aicpu_thread_num_ - 1). The
-    // orch-phase pool is a single instance, so its pool index does not encode
-    // the AICPU thread — derive the thread number from aicpu_thread_num_.
-    // aicpu_thread_num_ is >= 1 (DeviceRunner::run validates launch_aicpu_num in
-    // [1, PLATFORM_MAX_AICPU_THREADS] before initialize()), so the subtraction
-    // can't go negative. This is a log-only display value, never an index.
-    const int orch_thread = aicpu_thread_num_ - 1;
-    LOG_INFO_V0(
-        "Collecting phase metadata: scheduler threads 0-%d, orchestrator thread %d", num_sched - 1, orch_thread
-    );
-
-    for (size_t t = 0; t < collected_sched_phase_records_.size(); t++) {
-        if (!collected_sched_phase_records_[t].empty()) {
-            LOG_INFO_V0("  Sched thread %zu: %zu records", t, collected_sched_phase_records_[t].size());
-        }
-    }
-    for (size_t t = 0; t < collected_orch_phase_records_.size(); t++) {
-        if (!collected_orch_phase_records_[t].empty()) {
-            LOG_INFO_V0("  Orch thread %d: %zu records", orch_thread, collected_orch_phase_records_[t].size());
-        }
-    }
-
-    // has_phase_data_ is set by copy_sched_phase_buffer / copy_orch_phase_buffer
-    // during the drain — every push goes through those call sites and toggles
-    // the flag. No re-scan needed here.
-
-    // Core-to-thread mapping (header-resident; not buffered).
-    int num_phase_cores = static_cast<int>(header->num_phase_cores);
-    if (num_phase_cores > 0 && num_phase_cores <= PLATFORM_MAX_CORES) {
-        core_to_thread_.assign(header->core_to_thread, header->core_to_thread + num_phase_cores);
-        LOG_INFO_V0("  Core-to-thread mapping: %d cores", num_phase_cores);
-    }
-
-    LOG_INFO_V0(
-        "Phase metadata collection complete: has_phase_data=%s",
-        has_phase_data_.load(std::memory_order_relaxed) ? "yes" : "no"
-    );
-}
-
-void L2SwimlaneCollector::set_core_types(const CoreType *types, int n) {
-    if (types == nullptr || n <= 0) {
-        core_types_.clear();
-        return;
-    }
-    core_types_.assign(types, types + n);
-}
-
-// JSON v2 emit: the host now dumps raw cycle-domain per-stream records plus
-// metadata, and `swimlane_converter.py` performs the join (AICore↔AICPU on
-// reg_task_id, base_time normalization, cycles→µs conversion, sort, core_type
-// lookup, func_id resolution against deps.json). Moving the join into Python
-// makes the schema easy to evolve without round-tripping through C++ + a
-// rebuild, and shrinks this file to a pure dump.
-int L2SwimlaneCollector::export_swimlane_json() {
-    if (shm_host_ == nullptr) {
-        return -1;
-    }
-
-    // Empty-export guard: nothing useful on disk if every per-stream source is
-    // empty. AICPU_TIMING+ relies on `collected_perf_records_`; AICORE_TIMING
-    // (level=1) relies on `collected_aicore_records_` alone.
-    bool has_any_records = false;
-    for (const auto &core_records : collected_perf_records_) {
-        if (!core_records.empty()) {
-            has_any_records = true;
-            break;
-        }
-    }
-    if (!has_any_records) {
-        for (const auto &ac_records : collected_aicore_records_) {
-            if (!ac_records.empty()) {
-                has_any_records = true;
-                break;
-            }
-        }
-    }
-    if (!has_any_records) {
-        LOG_WARN("Warning: No performance data to export.");
-        return -1;
-    }
-
-    std::error_code ec;
-    std::filesystem::create_directories(output_prefix_, ec);
-    if (ec) {
-        LOG_ERROR("Error: Failed to create output directory %s: %s", output_prefix_.c_str(), ec.message().c_str());
-        return -1;
-    }
-
-    std::string filepath = output_prefix_ + "/l2_swimlane_records.json";
-    std::ofstream outfile(filepath);
-    if (!outfile.is_open()) {
-        LOG_ERROR("Error: Failed to open file: %s", filepath.c_str());
-        return -1;
-    }
-
-    int l2_swimlane_level = static_cast<int>(l2_swimlane_level_);
-
-    outfile << "{\n";
-    outfile << "  \"l2_swimlane_level\": " << l2_swimlane_level << ",\n";
-
-    // metadata: everything python needs that isn't in a per-record stream.
-    // clock_freq_hz drives the cycles→µs conversion (a2a3 = 50 MHz, a5 =
-    // 1 GHz — must come from the host, not be hardcoded in python).
-    outfile << "  \"metadata\": {\n";
-    outfile << "    \"clock_freq_hz\": " << PLATFORM_PROF_SYS_CNT_FREQ << ",\n";
-    outfile << "    \"num_cores\": " << num_aicore_ << ",\n";
-    outfile << "    \"core_types\": [";
-    for (int i = 0; i < num_aicore_; i++) {
-        CoreType ct = (i < static_cast<int>(core_types_.size())) ? core_types_[i] : CoreType::AIV;
-        if (i > 0) outfile << ", ";
-        outfile << "\"" << ((ct == CoreType::AIC) ? "aic" : "aiv") << "\"";
-    }
-    outfile << "]";
-    if (!core_to_thread_.empty()) {
-        outfile << ",\n    \"core_to_thread\": [";
-        for (size_t i = 0; i < core_to_thread_.size(); i++) {
-            if (i > 0) outfile << ", ";
-            outfile << static_cast<int>(core_to_thread_[i]);
-        }
-        outfile << "]";
-    }
-    outfile << "\n  },\n";
-
-    // Per-stream raw records. Flat array of tuples — compact at scale (a real
-    // PA trace has ~100K records, and per-field JSON keys would dominate the
-    // file size). Column order is documented in the schema comment at the top
-    // of swimlane_converter.py's v2 reader.
-    //
-    //   aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, receive_to_start_cycles]
-    //   aicpu_tasks:  [core_id, reg_task_id, dispatch_cycles, finish_cycles]
-    {
-        // copy_aicore_buffer already drops r.start_time == 0 slots when
-        // collecting from the device side, so no defensive filter here.
-        outfile << "  \"aicore_tasks\": [";
-        bool first = true;
-        size_t total = 0;
-        for (size_t core_idx = 0; core_idx < collected_aicore_records_.size(); core_idx++) {
-            for (const auto &r : collected_aicore_records_[core_idx]) {
-                if (!first) outfile << ",";
-                outfile << "\n    [" << core_idx << ", " << r.task_token_raw << ", " << r.reg_task_id << ", "
-                        << r.start_time << ", " << r.end_time << ", " << r.receive_to_start_cycles << "]";
-                first = false;
-                total++;
-            }
-        }
-        if (!first) outfile << "\n  ";
-        outfile << "]";
-        LOG_INFO_V0("  aicore_tasks: %zu records", total);
-    }
-    {
-        outfile << ",\n  \"aicpu_tasks\": [";
-        bool first = true;
-        size_t total = 0;
-        for (size_t core_idx = 0; core_idx < collected_perf_records_.size(); core_idx++) {
-            for (const auto &r : collected_perf_records_[core_idx]) {
-                if (!first) outfile << ",";
-                outfile << "\n    [" << core_idx << ", " << r.reg_task_id << ", " << r.dispatch_time << ", "
-                        << r.finish_time << "]";
-                first = false;
-                total++;
-            }
-        }
-        if (!first) outfile << "\n  ";
-        outfile << "]";
-        LOG_INFO_V0("  aicpu_tasks: %zu records", total);
-    }
-
-    // Phase records keep their per-thread sub-array shape so the python
-    // consumer's existing iteration pattern (one thread per inner list) stays
-    // unchanged; only the field names move from *_us to *_cycles.
-    if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) {
-        auto sched_phase_name = [](L2SwimlaneSchedPhaseKind kind) -> const char * {
-            switch (kind) {
-            case L2SwimlaneSchedPhaseKind::Complete:
-                return "complete";
-            case L2SwimlaneSchedPhaseKind::Dispatch:
-                return "dispatch";
-            case L2SwimlaneSchedPhaseKind::Release:
-                return "release";
-            case L2SwimlaneSchedPhaseKind::Wire:
-                return "wire";
-            case L2SwimlaneSchedPhaseKind::Dummy:
-                return "dummy";
-            case L2SwimlaneSchedPhaseKind::EarlyDispatch:
-                return "early_dispatch";
-            case L2SwimlaneSchedPhaseKind::Resolve:
-                return "resolve";
-            case L2SwimlaneSchedPhaseKind::DummyTask:
-                return "dummy_task";
-            }
-            return "unknown";
-        };
-
-        auto emit_depth_array = [&outfile](const char *key, const int16_t arr[L2SWIMLANE_NUM_QUEUE_SHAPES]) {
-            outfile << ", \"" << key << "\": [" << arr[0] << "," << arr[1] << "," << arr[2] << "]";
-        };
-        outfile << ",\n  \"aicpu_scheduler_phases\": [\n";
-        for (size_t t = 0; t < collected_sched_phase_records_.size(); t++) {
-            outfile << "    [";
-            bool first = true;
-            for (const auto &pr : collected_sched_phase_records_[t]) {
-                if (!first) outfile << ",";
-                outfile << "\n      {\"kind\": \"" << sched_phase_name(pr.kind) << "\""
-                        << ", \"start_cycles\": " << pr.start_time << ", \"end_cycles\": " << pr.end_time
-                        << ", \"loop_iter\": " << pr.loop_iter << ", \"tasks_processed\": " << pr.tasks_processed;
-                if (pr.kind == L2SwimlaneSchedPhaseKind::Dispatch) {
-                    outfile << ", \"pop_hit\": " << pr.pop_hit << ", \"pop_miss\": " << pr.pop_miss;
-                }
-                // Queue-depth snapshots — [AIC, AIV, MIX] per L2SwimlaneAicpuSchedPhaseRecord docstring.
-                emit_depth_array("shared_at_start", pr.shared_depth_at_start);
-                emit_depth_array("shared_at_end", pr.shared_depth_at_end);
-                outfile << "}";
-                first = false;
-            }
-            if (!first) outfile << "\n    ";
-            outfile << "]";
-            if (t < collected_sched_phase_records_.size() - 1) outfile << ",";
-            outfile << "\n";
-        }
-        outfile << "  ]";
-
-        bool has_orch_phases = false;
-        if (l2_swimlane_level_ >= L2SwimlaneLevel::ORCH_PHASES) {
-            for (const auto &v : collected_orch_phase_records_) {
-                if (!v.empty()) {
-                    has_orch_phases = true;
-                    break;
-                }
-            }
-        }
-        if (has_orch_phases) {
-            size_t orch_lanes = static_cast<size_t>(get_l2_swimlane_header(shm_host_)->num_orch_phase_threads);
-            if (orch_lanes == 0 || orch_lanes > collected_orch_phase_records_.size()) {
-                orch_lanes = collected_orch_phase_records_.size();
-            }
-            outfile << ",\n  \"aicpu_orchestrator_phases\": [\n";
-            for (size_t t = 0; t < orch_lanes; t++) {
-                outfile << "    [";
-                bool first = true;
-                for (const auto &pr : collected_orch_phase_records_[t]) {
-                    if (!first) outfile << ",";
-                    outfile << "\n      {\"submit_idx\": " << pr.submit_idx << ", \"task_id\": " << pr.task_id
-                            << ", \"start_cycles\": " << pr.start_time << ", \"end_cycles\": " << pr.end_time << "}";
-                    first = false;
-                }
-                if (!first) outfile << "\n    ";
-                outfile << "]";
-                if (t < orch_lanes - 1) outfile << ",";
-                outfile << "\n";
-            }
-            outfile << "  ]";
-        }
-    }
-
-    outfile << "\n}\n";
-    outfile.close();
-
-    if (!outfile) {
-        LOG_ERROR("Failed to write JSON file (stream error): %s", filepath.c_str());
-        return -1;
-    }
-
-    LOG_INFO_V0("=== JSON Export Complete ===");
-    LOG_INFO_V0("File: %s", filepath.c_str());
-
-    return 0;
-}
-
-int L2SwimlaneCollector::finalize(L2SwimlaneUnregisterCallback unregister_cb, const L2SwimlaneFreeCallback &free_cb) {
-    if (shm_host_ == nullptr) {
-        return 0;
-    }
-
-    // Stop mgmt + collector threads if the caller didn't already (idempotent).
-    stop();
-
-    LOG_DEBUG("Cleaning up performance profiling resources");
-
-    // Every release site below goes through release_one_buffer so the
-    // unregister and free are an inseparable pair — each dev_ptr that
-    // alloc_single_buffer installed via halHostRegister is unregistered
-    // before its device memory is freed. Without this the Ascend HAL's
-    // per-device registration table accumulates leaked entries across
-    // init_l2_swimlane() invocations and back-to-back l2_swimlane tests on
-    // a reused Worker fail at rc=8 from halHostRegister.
-
-    // Free standalone l2_swimlane_aicore_rotation_table table
-    release_one_buffer(aicore_ring_addr_table_dev_, unregister_cb, free_cb);
-    aicore_ring_addr_table_dev_ = nullptr;
-
-    // Release framework-owned buffers (recycled pools, done_queue, ready_queue).
-    manager_.release_owned_buffers([this, unregister_cb, free_cb](void *p) {
-        release_one_buffer(p, unregister_cb, free_cb);
-    });
-
-    // Per-core: current buffer + free_queue slots — these were owned by
-    // the AICPU side, not the framework. Same drain pattern for both the
-    // L2SwimlaneAicpuTaskBuffer pool and the L2SwimlaneAicoreTaskBuffer pool.
-    auto drain_free_queue = [&](L2SwimlaneFreeQueue &fq) {
-        rmb();
-        uint32_t head = fq.head;
-        uint32_t tail = fq.tail;
-        uint32_t queued = tail - head;
-        if (queued > PLATFORM_PROF_SLOT_COUNT) {
-            queued = PLATFORM_PROF_SLOT_COUNT;
-        }
-        for (uint32_t k = 0; k < queued; k++) {
-            uint32_t slot = (head + k) % PLATFORM_PROF_SLOT_COUNT;
-            release_one_buffer(reinterpret_cast<void *>(fq.buffer_ptrs[slot]), unregister_cb, free_cb);
-            fq.buffer_ptrs[slot] = 0;
-        }
-        fq.head = tail;
-    };
-
-    for (int i = 0; i < num_aicore_; i++) {
-        L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(shm_host_, i);
-        release_one_buffer(reinterpret_cast<void *>(state->head.current_buf_ptr), unregister_cb, free_cb);
-        state->head.current_buf_ptr = 0;
-        drain_free_queue(state->free_queue);
-
-        L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(shm_host_, num_aicore_, i);
-        release_one_buffer(reinterpret_cast<void *>(ac_state->head.current_buf_ptr), unregister_cb, free_cb);
-        ac_state->head.current_buf_ptr = 0;
-        drain_free_queue(ac_state->free_queue);
-    }
-
-    auto release_phase_pool = [&](L2SwimlaneAicpuTaskPool *state) {
-        release_one_buffer(reinterpret_cast<void *>(state->head.current_buf_ptr), unregister_cb, free_cb);
-        state->head.current_buf_ptr = 0;
-
-        rmb();
-        uint32_t head = state->free_queue.head;
-        uint32_t tail = state->free_queue.tail;
-        uint32_t queued = tail - head;
-        if (queued > PLATFORM_PROF_SLOT_COUNT) {
-            queued = PLATFORM_PROF_SLOT_COUNT;
-        }
-        for (uint32_t k = 0; k < queued; k++) {
-            uint32_t slot = (head + k) % PLATFORM_PROF_SLOT_COUNT;
-            release_one_buffer(reinterpret_cast<void *>(state->free_queue.buffer_ptrs[slot]), unregister_cb, free_cb);
-            state->free_queue.buffer_ptrs[slot] = 0;
-        }
-        state->free_queue.head = tail;
-    };
-    int num_phase_threads = PLATFORM_MAX_AICPU_THREADS;
-    for (int t = 0; t < num_phase_threads; t++) {
-        release_phase_pool(get_sched_phase_buffer_state(shm_host_, num_aicore_, t));
-    }
-    for (int t = 0; t < num_phase_threads; t++) {
-        release_phase_pool(get_orch_phase_buffer_state(shm_host_, num_aicore_, t));
-    }
-
-    // Main shm: unregister + free as a pair, same as every other buffer.
-    // ProfilerBase's set_memory_context handed register_cb == nullptr iff the
-    // caller doesn't intend to register, so checking unregister_cb inside
-    // release_one_buffer is sufficient — no separate ``was_registered_`` flag.
-    release_one_buffer(perf_shared_mem_dev_, unregister_cb, free_cb);
-    LOG_DEBUG("Main shm released");
-
-    perf_shared_mem_dev_ = nullptr;
-    // shm_host_ aliases freed device/host memory now; null it so is_initialized()
-    // reports false, the dtor's "destroyed without finalize()" warning stays
-    // quiet, and a re-entrant finalize() / re-init hits the early-out instead of
-    // walking freed buffer state. Mirrors PMU/DepGen/TensorDump collectors.
-    shm_host_ = nullptr;
-    collected_perf_records_.clear();
-    collected_sched_phase_records_.clear();
-    collected_orch_phase_records_.clear();
-    core_to_thread_.clear();
-    has_phase_data_.store(false, std::memory_order_relaxed);
-    total_perf_collected_.store(0, std::memory_order_relaxed);
-    total_sched_phase_collected_.store(0, std::memory_order_relaxed);
-    total_orch_phase_collected_.store(0, std::memory_order_relaxed);
-    clear_memory_context();
-
-    LOG_DEBUG("Performance profiling cleanup complete");
-    return 0;
-}
diff --git a/src/a2a3/platform/sim/host/CMakeLists.txt b/src/a2a3/platform/sim/host/CMakeLists.txt
index 3b9f283f0..40813f369 100644
--- a/src/a2a3/platform/sim/host/CMakeLists.txt
+++ b/src/a2a3/platform/sim/host/CMakeLists.txt
@@ -45,10 +45,10 @@ list(APPEND HOST_RUNTIME_SOURCES
     "${CMAKE_CURRENT_SOURCE_DIR}/profiling_copy.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/sim/aicpu/platform_aicpu_affinity.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform_comm/comm_sim.cpp"
diff --git a/src/a5/platform/onboard/host/CMakeLists.txt b/src/a5/platform/onboard/host/CMakeLists.txt
index 7450a1eb9..177abdecf 100644
--- a/src/a5/platform/onboard/host/CMakeLists.txt
+++ b/src/a5/platform/onboard/host/CMakeLists.txt
@@ -43,9 +43,9 @@ list(APPEND HOST_RUNTIME_SOURCES
     "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/host_regs.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/comm_hccl.cpp"
diff --git a/src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp b/src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
deleted file mode 100644
index e462c7579..000000000
--- a/src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
+++ /dev/null
@@ -1,432 +0,0 @@
-/*
- * Copyright (c) PyPTO Contributors.
- * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
- * CANN Open Software License Agreement Version 2.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- * -----------------------------------------------------------------------------------------------------------
- */
-
-/**
- * @file dep_gen_collector_aicpu.cpp
- * @brief AICPU-side dep_gen capture implementation
- *
- * Single-instance: dep_gen captures the orchestrator's submit_task stream,
- * so there is one BufferState and one current_buf — no per-core arrays.
- *
- * Buffer switching (SPSC):
- *   - Host pushes free DepGenBuffers via free_queue.
- *   - AICPU pops when current buffer fills; pushes full buffer to per-thread
- *     ready_queue (indexed by orch_thread_idx).
- *   - Full buffers are published before AICPU tries to recover a replacement.
- *     If recovery is delayed, later records are counted as dropped until host
- *     replenishes free_queue. Host reads dropped at finalize to decide whether
- *     to emit deps.json.
- */
-
-#include "aicpu/dep_gen_collector_aicpu.h"
-
-#include <cstring>
-
-#include "aicpu/device_time.h"
-#include "common/memory_barrier.h"
-#include "common/platform_config.h"
-#include "common/unified_log.h"
-
-static uint64_t g_platform_dep_gen_base = 0;
-static bool g_enable_dep_gen = false;
-
-// File-local cached state for the single dep_gen instance (the orchestrator).
-static DepGenDataHeader *s_dep_gen_header = nullptr;
-static DepGenBufferState *s_dep_gen_state = nullptr;
-static int s_orch_thread_idx = -1;  // set via dep_gen_aicpu_set_orch_thread_idx
-
-static constexpr uint64_t kDepGenQueueBackpressureWaitCycles = PLATFORM_PROF_SYS_CNT_FREQ / 50000;  // 20 us
-
-extern "C" void set_platform_dep_gen_base(uint64_t dep_gen_data_base) { g_platform_dep_gen_base = dep_gen_data_base; }
-
-extern "C" uint64_t get_platform_dep_gen_base() { return g_platform_dep_gen_base; }
-
-extern "C" void set_dep_gen_enabled(bool enable) { g_enable_dep_gen = enable; }
-
-extern "C" bool is_dep_gen_enabled() { return g_enable_dep_gen; }
-
-void dep_gen_aicpu_set_orch_thread_idx(int thread_idx) { s_orch_thread_idx = thread_idx; }
-
-// ---------------------------------------------------------------------------
-// Internal: enqueue full buffer to per-thread ready_queue
-// ---------------------------------------------------------------------------
-
-static bool
-wait_for_ready_queue_space(DepGenDataHeader *header, int thread_idx, uint32_t *tail_out, uint32_t *head_out) {
-    if (header == nullptr || thread_idx < 0 || thread_idx >= PLATFORM_MAX_AICPU_THREADS) {
-        return false;
-    }
-    const uint32_t capacity = PLATFORM_DEP_GEN_READYQUEUE_SIZE;
-    const uint64_t start = get_sys_cnt_aicpu();
-
-    do {
-        uint32_t current_tail = header->queue_tails[thread_idx];
-        uint32_t current_head = header->queue_heads[thread_idx];
-        uint32_t next_tail = (current_tail + 1) % capacity;
-        if (next_tail != current_head) {
-            *tail_out = current_tail;
-            *head_out = current_head;
-            return true;
-        }
-        if (get_sys_cnt_aicpu() - start >= kDepGenQueueBackpressureWaitCycles) {
-            break;
-        }
-    } while (true);
-    return false;
-}
-
-static bool wait_for_free_queue_entry(DepGenFreeQueue *free_queue, uint32_t *head_out, uint32_t *tail_out) {
-    if (free_queue == nullptr) {
-        return false;
-    }
-    const uint64_t start = get_sys_cnt_aicpu();
-
-    do {
-        uint32_t head = free_queue->head;
-        uint32_t tail = free_queue->tail;
-        if (head != tail) {
-            *head_out = head;
-            *tail_out = tail;
-            rmb();  // acquire: order the tail read above before the caller's buffer_ptrs read
-            return true;
-        }
-        if (get_sys_cnt_aicpu() - start >= kDepGenQueueBackpressureWaitCycles) {
-            break;
-        }
-    } while (true);
-    return false;
-}
-
-static int enqueue_dep_gen_ready_buffer(uint64_t buffer_ptr, uint32_t buffer_seq) {
-    int q = s_orch_thread_idx;
-    uint32_t capacity = PLATFORM_DEP_GEN_READYQUEUE_SIZE;
-    uint32_t current_tail = 0;
-    uint32_t current_head = 0;
-    if (!wait_for_ready_queue_space(s_dep_gen_header, q, &current_tail, &current_head)) {
-        return -1;
-    }
-
-    uint32_t next_tail = (current_tail + 1) % capacity;
-    s_dep_gen_header->queues[q][current_tail].instance_index = 0;
-    s_dep_gen_header->queues[q][current_tail].buffer_ptr = buffer_ptr;
-    s_dep_gen_header->queues[q][current_tail].buffer_seq = buffer_seq;
-    wmb();  // publish: entry fields visible before the tail advance
-    s_dep_gen_header->queue_tails[q] = next_tail;
-    return 0;
-}
-
-static DepGenBuffer *try_pop_dep_gen_buffer(uint32_t next_seq) {
-    if (s_dep_gen_state == nullptr) {
-        return nullptr;
-    }
-    uint32_t head = 0;
-    uint32_t tail = 0;
-    if (!wait_for_free_queue_entry(&s_dep_gen_state->free_queue, &head, &tail)) {
-        return nullptr;
-    }
-
-    uint64_t new_buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT];
-    s_dep_gen_state->free_queue.head = head + 1;
-    if (new_buf_ptr == 0) {
-        return nullptr;
-    }
-
-    DepGenBuffer *new_buf = reinterpret_cast<DepGenBuffer *>(new_buf_ptr);
-    new_buf->count = 0;
-    s_dep_gen_state->current_buf_ptr = new_buf_ptr;
-    s_dep_gen_state->current_buf_seq = next_seq;
-    wmb();
-    return new_buf;
-}
-
-// ---------------------------------------------------------------------------
-// Internal: switch the current buffer
-// ---------------------------------------------------------------------------
-
-static void dep_gen_switch_buffer() {
-    if (s_dep_gen_state == nullptr) {
-        return;
-    }
-    DepGenBuffer *full_buf = reinterpret_cast<DepGenBuffer *>(s_dep_gen_state->current_buf_ptr);
-    if (full_buf == nullptr) {
-        return;
-    }
-
-    uint32_t seq = s_dep_gen_state->current_buf_seq;
-    int rc = enqueue_dep_gen_ready_buffer(s_dep_gen_state->current_buf_ptr, seq);
-    if (rc != 0) {
-        LOG_ERROR("dep_gen: failed to enqueue full buffer (ready_queue full), %u records dropped", full_buf->count);
-        s_dep_gen_state->dropped_record_count += full_buf->count;
-        full_buf->count = 0;
-        wmb();
-        return;
-    }
-
-    uint32_t next_seq = seq + 1;
-    s_dep_gen_state->current_buf_ptr = 0;
-    s_dep_gen_state->current_buf_seq = next_seq;
-    wmb();
-
-    (void)try_pop_dep_gen_buffer(next_seq);
-}
-
-// ---------------------------------------------------------------------------
-// Public interface
-// ---------------------------------------------------------------------------
-
-void dep_gen_aicpu_init() {
-    void *base = reinterpret_cast<void *>(get_platform_dep_gen_base());
-    if (base == nullptr) {
-        LOG_ERROR("dep_gen_aicpu_init: dep_gen_data_base is NULL");
-        return;
-    }
-    s_dep_gen_header = get_dep_gen_header(base);
-    s_dep_gen_state = get_dep_gen_buffer_state(base, /*instance_index=*/0);
-
-    rmb();
-    uint32_t head = s_dep_gen_state->free_queue.head;
-    uint32_t tail = s_dep_gen_state->free_queue.tail;
-
-    if (head != tail) {
-        (void)try_pop_dep_gen_buffer(0);
-        uint64_t buf_ptr = s_dep_gen_state->current_buf_ptr;
-        LOG_INFO_V0("dep_gen: popped initial buffer addr=0x%lx", buf_ptr);
-    } else {
-        LOG_ERROR("dep_gen: free_queue empty during init");
-        s_dep_gen_state->current_buf_ptr = 0;
-    }
-    wmb();
-}
-
-void dep_gen_aicpu_record_submit(
-    uint64_t task_id_raw, bool in_manual_scope, int tensor_count, const void *const *tensor_ptrs,
-    const uint8_t *arg_types, int explicit_dep_count, const uint64_t *explicit_deps_raw, int block_num,
-    const int32_t kernel_ids[3]
-) {
-    if (!g_enable_dep_gen || s_dep_gen_state == nullptr) {
-        return;
-    }
-
-    // Account every attempted record so total == collected + dropped on host.
-    s_dep_gen_state->total_record_count += 1;
-
-    int dc = explicit_dep_count;
-    if (dc < 0) dc = 0;
-    if (dc > 0 && explicit_deps_raw == nullptr) dc = 0;
-    int needed = dep_gen_records_needed_for(dc);
-
-    rmb();
-    uint64_t cur_ptr = s_dep_gen_state->current_buf_ptr;
-    if (cur_ptr == 0) {
-        DepGenBuffer *recovered = try_pop_dep_gen_buffer(s_dep_gen_state->current_buf_seq);
-        if (recovered == nullptr) {
-            s_dep_gen_state->dropped_record_count += 1;
-            wmb();
-            return;
-        }
-        cur_ptr = s_dep_gen_state->current_buf_ptr;
-    }
-    DepGenBuffer *buf = reinterpret_cast<DepGenBuffer *>(cur_ptr);
-
-    // Snapshot the count from volatile shared memory into a local so capacity
-    // math, base-record idx, and the final publish all use the same value.
-    // Single-writer ownership means a re-read would return the same value
-    // today, but a local snapshot makes the invariant explicit and is also
-    // a guardrail if a future device-side actor ever races count.
-    uint32_t local_count = buf->count;
-
-    // Reserve the whole chain up front. If it won't fit in the current
-    // buffer, switch first (skipping the switch when the current buffer is
-    // already empty — switching would just enqueue a zero-record buffer and
-    // pop a fresh one we'd truncate into anyway). Then, regardless of whether
-    // we switched, if the chain still won't fit (chain larger than the
-    // buffer), cap dc to what the buffer can hold and log truncation.
-    if (local_count > 0 &&
-        local_count + static_cast<uint32_t>(needed) > static_cast<uint32_t>(PLATFORM_DEP_GEN_RECORDS_PER_BUFFER)) {
-        dep_gen_switch_buffer();
-        rmb();
-        cur_ptr = s_dep_gen_state->current_buf_ptr;
-        if (cur_ptr == 0) {
-            DepGenBuffer *recovered = try_pop_dep_gen_buffer(s_dep_gen_state->current_buf_seq);
-            if (recovered == nullptr) {
-                s_dep_gen_state->dropped_record_count += 1;
-                wmb();
-                return;
-            }
-            cur_ptr = s_dep_gen_state->current_buf_ptr;
-        }
-        buf = reinterpret_cast<DepGenBuffer *>(cur_ptr);
-        local_count = buf->count;  // refresh after switch — new buffer starts at 0
-    }
-
-    const int capacity = PLATFORM_DEP_GEN_RECORDS_PER_BUFFER - static_cast<int>(local_count);
-    if (capacity <= 0) {
-        // local_count is bounded by the previous writer's publish step, so
-        // this is only reachable if shared memory was corrupted out from
-        // under us. Drop the record and bail rather than write past the end
-        // of buf->records[].
-        LOG_ERROR("dep_gen: invalid capacity %d (local_count=%u), dropping record", capacity, local_count);
-        s_dep_gen_state->dropped_record_count += 1;
-        wmb();
-        return;
-    }
-    if (needed > capacity) {
-        // Compute the largest dc that fits in `capacity` slots.
-        int dc_fit = DEP_GEN_MAX_EXPLICIT_DEPS + (capacity - 1) * DEP_GEN_OVERFLOW_DEPS_PER_RECORD;
-        LOG_ERROR(
-            "dep_gen: chain (%d records for %d deps) exceeds buffer capacity (%d slots), truncating to %d deps", needed,
-            dc, capacity, dc_fit
-        );
-        dc = dc_fit;
-        needed = dep_gen_records_needed_for(dc);
-    }
-
-    int tc = tensor_count;
-    if (tc < 0) {
-        tc = 0;
-    } else if (tc > CORE_MAX_TENSOR_ARGS) {
-        // The runtime's Arg also caps at CORE_MAX_TENSOR_ARGS, so this should
-        // never trip; clamp defensively to keep the writer crash-free.
-        LOG_ERROR("dep_gen: tensor_count %d > CORE_MAX_TENSOR_ARGS (%d), truncating", tc, CORE_MAX_TENSOR_ARGS);
-        tc = CORE_MAX_TENSOR_ARGS;
-    }
-
-    // ---- Write base record ----
-    uint32_t idx = local_count;
-    DepGenRecord *rec = &buf->records[idx];
-
-    rec->task_id = task_id_raw;
-    // Cast the enum to uint32_t before the ternary so Linux GCC's -Wextra
-    // does not warn about "enumerated and non-enumerated type in conditional".
-    uint32_t base_flags = in_manual_scope ? static_cast<uint32_t>(DEP_GEN_FLAG_IN_MANUAL_SCOPE) : 0u;
-    if (needed > 1) {
-        base_flags |= static_cast<uint32_t>(DEP_GEN_FLAG_HAS_OVERFLOW);
-    }
-    rec->flags = base_flags;
-    rec->tensor_count = static_cast<uint16_t>(tc);
-    rec->block_num = block_num > 0 ? static_cast<uint32_t>(block_num) : 1u;
-
-    int base_dc = (dc < DEP_GEN_MAX_EXPLICIT_DEPS) ? dc : DEP_GEN_MAX_EXPLICIT_DEPS;
-    rec->explicit_dep_count = static_cast<uint16_t>(base_dc);
-
-    // explicit_deps (tail of the entry, packed; replay reads only the first base_dc entries)
-    if (base_dc > 0) {
-        memcpy(rec->explicit_deps, explicit_deps_raw, static_cast<size_t>(base_dc) * sizeof(uint64_t));
-    }
-
-    // arg_types
-    if (tc > 0 && arg_types != nullptr) {
-        memcpy(rec->arg_types, arg_types, static_cast<size_t>(tc));
-    }
-
-    // Per-subslot kernel ids (AIC, AIV0, AIV1). The orchestrator owns the
-    // identity-side of the swimlane join: with task_id (PTO2 raw) + kernel_id
-    // captured here, the host post-processor can name every AICore record.
-    // Inactive subslots stay at INVALID_KERNEL_ID (-1); the caller is expected
-    // to pass that sentinel rather than 0.
-    if (kernel_ids != nullptr) {
-        rec->kernel_id[0] = kernel_ids[0];
-        rec->kernel_id[1] = kernel_ids[1];
-        rec->kernel_id[2] = kernel_ids[2];
-    } else {
-        rec->kernel_id[0] = -1;
-        rec->kernel_id[1] = -1;
-        rec->kernel_id[2] = -1;
-    }
-
-    // tensors[]: per-slot 128-byte blob (or zero if pointer is null — OUTPUT slot)
-    if (tc > 0) {
-        if (tensor_ptrs == nullptr) {
-            memset(rec->tensors, 0, static_cast<size_t>(tc) * DEP_GEN_TENSOR_SIZE);
-        } else {
-            for (int i = 0; i < tc; i++) {
-                if (tensor_ptrs[i] == nullptr) {
-                    memset(rec->tensors[i], 0, DEP_GEN_TENSOR_SIZE);
-                } else {
-                    memcpy(rec->tensors[i], tensor_ptrs[i], DEP_GEN_TENSOR_SIZE);
-                }
-            }
-        }
-    }
-
-    // ---- Write overflow chain ----
-    // Charge each overflow slot to total_overflow_record_count so the host's
-    // reconciliation equation (`collected + dropped == total + total_overflow`)
-    // accounts for chain expansion. total_record_count stays "one per submit"
-    // — see DepGenBufferState doc.
-    if (needed > 1) {
-        s_dep_gen_state->total_overflow_record_count += static_cast<uint32_t>(needed - 1);
-    }
-    int written = base_dc;
-    for (int slot = 1; slot < needed; slot++) {
-        auto *over = reinterpret_cast<DepGenOverflowRecord *>(&buf->records[idx + static_cast<uint32_t>(slot)]);
-        over->task_id = task_id_raw;
-        const int chunk =
-            ((dc - written) < DEP_GEN_OVERFLOW_DEPS_PER_RECORD) ? (dc - written) : DEP_GEN_OVERFLOW_DEPS_PER_RECORD;
-        const bool is_last = (slot == needed - 1);
-        uint32_t over_flags = static_cast<uint32_t>(DEP_GEN_FLAG_OVERFLOW);
-        if (is_last) {
-            over_flags |= static_cast<uint32_t>(DEP_GEN_FLAG_LAST_OVERFLOW);
-        }
-        over->flags = over_flags;
-        over->dep_count = static_cast<uint16_t>(chunk);
-        over->_reserved = 0;
-        if (chunk > 0) {
-            memcpy(over->deps, explicit_deps_raw + written, static_cast<size_t>(chunk) * sizeof(uint64_t));
-        }
-        written += chunk;
-    }
-
-    // Publish all reserved slots atomically — host either sees the old count
-    // (chain invisible) or the new count with the full chain committed. The
-    // single trailing wmb() flushes both the record payloads and the count
-    // store, matching the pre-chain contract.
-    buf->count = idx + static_cast<uint32_t>(needed);
-    wmb();
-}
-
-void dep_gen_aicpu_flush() {
-    if (s_dep_gen_header == nullptr || s_dep_gen_state == nullptr) {
-        return;
-    }
-
-    rmb();
-    uint64_t buf_ptr = s_dep_gen_state->current_buf_ptr;
-    if (buf_ptr == 0) {
-        return;
-    }
-    DepGenBuffer *buf = reinterpret_cast<DepGenBuffer *>(buf_ptr);
-    if (buf->count == 0) {
-        return;
-    }
-
-    uint32_t seq = s_dep_gen_state->current_buf_seq;
-    int rc = enqueue_dep_gen_ready_buffer(buf_ptr, seq);
-    if (rc == 0) {
-        LOG_INFO_V0("dep_gen: flushed buffer with %u records", buf->count);
-        s_dep_gen_state->current_buf_ptr = 0;
-        wmb();
-    } else {
-        LOG_ERROR("dep_gen: flush failed (ready_queue full), %u records dropped", buf->count);
-        s_dep_gen_state->dropped_record_count += buf->count;
-        buf->count = 0;
-        s_dep_gen_state->current_buf_ptr = 0;
-        wmb();
-    }
-}
-
-void dep_gen_aicpu_finalize() {
-    // No HW state to restore (unlike PMU). Reset file-local cache for cleanliness
-    // — the next init re-resolves these from the (potentially new) base anyway.
-    s_dep_gen_header = nullptr;
-    s_dep_gen_state = nullptr;
-    s_orch_thread_idx = -1;
-}
diff --git a/src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp b/src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
deleted file mode 100644
index 260374832..000000000
--- a/src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
+++ /dev/null
@@ -1,1016 +0,0 @@
-/*
- * Copyright (c) PyPTO Contributors.
- * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
- * CANN Open Software License Agreement Version 2.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- * -----------------------------------------------------------------------------------------------------------
- */
-
-/**
- * @file l2_swimlane_collector_aicpu.cpp
- * @brief AICPU performance data collection implementation (SPSC free queue)
- *
- * Uses per-core L2SwimlaneAicpuTaskPool with SPSC free queues for O(1) buffer switching.
- * Host memory manager dynamically allocates replacement buffers and pushes
- * them into the free_queue. Device pops from free_queue when switching.
- */
-
-#include "aicpu/l2_swimlane_collector_aicpu.h"
-
-#include <cinttypes>
-#include <cstring>
-
-#include "aicpu/platform_regs.h"
-#include "common/memory_barrier.h"
-#include "common/platform_config.h"
-#include "common/unified_log.h"
-
-// Cached pointers for hot-path access (set during init). Phase metadata
-// (num_sched_phase_threads, num_orch_phase_threads, num_phase_cores,
-// core_to_thread[]) lives inside L2SwimlaneDataHeader after the phase-header
-// merge; we keep a separate bool so phase-gated paths can check init-ran
-// without re-reading the device-shared header.
-static L2SwimlaneDataHeader *s_l2_swimlane_header = nullptr;
-static bool s_phase_initialized = false;
-
-// Per-core L2SwimlaneAicpuTaskPool cache
-static L2SwimlaneAicpuTaskPool *s_aicpu_task_pools[PLATFORM_MAX_CORES] = {};
-
-// Per-core L2SwimlaneAicoreTaskPool cache (lives in the same shared region;
-// host writes initial pool + the rotation channel that AICore polls).
-//
-// All AICore-side bookkeeping (rotation channel, free queue,
-// total_record_count, current_buf_seq) is owned by this shared struct — see
-// l2_swimlane_profiling.h. We deliberately do not keep AICPU-process-local
-// mirror counters because the struct's volatile fields are the single
-// source of truth across init/complete/rotate/flush. The high-water-mark
-// formula `total_record_count - current_buf_seq * BUFFER_SIZE` correctly
-// handles the failed-rotation case (free_queue empty or ready_queue full)
-// since current_buf_seq only bumps on a successful rotation.
-static L2SwimlaneAicoreTaskPool *s_aicore_task_pools[PLATFORM_MAX_CORES] = {};
-
-// Per-core AICPU-side dispatch count. Incremented on every
-// `l2_swimlane_aicpu_on_aicore_dispatch` call (= once per AICore dispatch).
-// When the pre-bump value is a non-zero multiple of PLATFORM_AICORE_BUFFER_SIZE,
-// AICPU rotates the AICore buffer before the upcoming write_reg(DATA_MAIN_BASE).
-// Single-writer per cell (the scheduler thread that owns the core).
-static uint32_t s_aicore_dispatched_count[PLATFORM_MAX_CORES] = {};
-
-// Per-core cached current-records-buffer pointer. Written by AICPU when
-// rotating buffers from inside `complete_record`. AICore writes to its own
-// per-core L2SwimlaneAicoreTaskBuffer (host-allocated, AICPU rotates) and AICPU
-// never reads from it on the hot path.
-static L2SwimlaneAicpuTaskBuffer *s_current_aicpu_task_buffers[PLATFORM_MAX_CORES] = {};
-
-// Per-thread sched-phase pool/buffer caches (per-scheduler-thread)
-static L2SwimlaneAicpuSchedPhasePool *s_sched_phase_pools[PLATFORM_MAX_AICPU_THREADS] = {};
-static L2SwimlaneAicpuSchedPhaseBuffer *s_current_sched_phase_buffers[PLATFORM_MAX_AICPU_THREADS] = {};
-
-// Per-thread orch-phase pool/buffer caches (one orch thread).
-static L2SwimlaneAicpuOrchPhasePool *s_orch_phase_pools[PLATFORM_MAX_AICPU_THREADS] = {};
-static L2SwimlaneAicpuOrchPhaseBuffer *s_current_orch_phase_buffers[PLATFORM_MAX_AICPU_THREADS] = {};
-
-static int s_orch_thread_idx = -1;
-
-// L2 swimlane platform state. Published by the host (via dlsym'd setters on sim)
-// or by the AICPU kernel entry (onboard) before perf init runs, so downstream
-// perf code can discover enablement + device-base without reading the generic
-// Runtime struct. Two channels (mirrors PMU):
-//   - g_enable_l2_swimlane (bool) — set at kernel entry from the bitmask bit
-//   - g_l2_swimlane_level (L2SwimlaneLevel) — promoted in
-//     l2_swimlane_aicpu_init from the shared-memory header so
-//     `>= AICPU_TIMING / SCHED_PHASES / ORCH_PHASES` gates have the granular
-//     value (exposed via get_l2_swimlane_level()).
-static uint64_t g_platform_l2_swimlane_base = 0;
-static bool g_enable_l2_swimlane = false;
-static L2SwimlaneLevel g_l2_swimlane_level = L2SwimlaneLevel::DISABLED;
-
-// AICore rotation-table device pointer (= KernelArgs::l2_swimlane_aicore_rotation_table).
-// Published by the host (sim: dlsym'd setter; onboard: from k_args via the
-// kernel entry); AICPU init walks it to fill per-core &rotation addresses.
-static uint64_t g_platform_l2_swimlane_aicore_rotation_table = 0;
-
-extern "C" void set_platform_l2_swimlane_base(uint64_t l2_swimlane_data_base) {
-    g_platform_l2_swimlane_base = l2_swimlane_data_base;
-}
-extern "C" uint64_t get_platform_l2_swimlane_base() { return g_platform_l2_swimlane_base; }
-extern "C" void set_l2_swimlane_enabled(bool enable) { g_enable_l2_swimlane = enable; }
-extern "C" bool is_l2_swimlane_enabled() { return g_enable_l2_swimlane; }
-extern "C" void set_platform_l2_swimlane_aicore_rotation_table(uint64_t table_addr) {
-    g_platform_l2_swimlane_aicore_rotation_table = table_addr;
-}
-extern "C" uint64_t get_platform_l2_swimlane_aicore_rotation_table() {
-    return g_platform_l2_swimlane_aicore_rotation_table;
-}
-L2SwimlaneLevel get_l2_swimlane_level() { return g_l2_swimlane_level; }
-
-static constexpr uint64_t kL2SwimlaneQueueBackpressureWaitCycles = PLATFORM_PROF_SYS_CNT_FREQ / 50000;  // 20 us
-
-static bool
-wait_for_ready_queue_space(L2SwimlaneDataHeader *header, int thread_idx, uint32_t *tail_out, uint32_t *head_out) {
-    if (header == nullptr || thread_idx < 0 || thread_idx >= PLATFORM_MAX_AICPU_THREADS) {
-        return false;
-    }
-    const uint32_t capacity = PLATFORM_PROF_READYQUEUE_SIZE;
-    const uint64_t start = get_sys_cnt_aicpu();
-
-    do {
-        uint32_t current_tail = header->queue_tails[thread_idx];
-        uint32_t current_head = header->queue_heads[thread_idx];
-        uint32_t next_tail = (current_tail + 1) % capacity;
-        if (next_tail != current_head) {
-            *tail_out = current_tail;
-            *head_out = current_head;
-            return true;
-        }
-        if (get_sys_cnt_aicpu() - start >= kL2SwimlaneQueueBackpressureWaitCycles) {
-            break;
-        }
-    } while (true);
-    return false;
-}
-
-static bool wait_for_free_queue_entry(L2SwimlaneFreeQueue *free_queue, uint32_t *head_out, uint32_t *tail_out) {
-    if (free_queue == nullptr) {
-        return false;
-    }
-    const uint64_t start = get_sys_cnt_aicpu();
-
-    do {
-        uint32_t head = free_queue->head;
-        uint32_t tail = free_queue->tail;
-        if (head != tail) {
-            *head_out = head;
-            *tail_out = tail;
-            rmb();  // acquire: order the tail read above before the caller's buffer_ptrs read
-            return true;
-        }
-        if (get_sys_cnt_aicpu() - start >= kL2SwimlaneQueueBackpressureWaitCycles) {
-            break;
-        }
-    } while (true);
-    return false;
-}
-
-/**
- * Enqueue ready buffer to per-thread queue
- *
- * @param header L2SwimlaneDataHeader pointer
- * @param thread_idx AICPU thread index (selects the per-thread ready queue)
- * @param core_index Core index for task entries, or pool ordinal for phase entries
- * @param buffer_ptr Device pointer to the full buffer
- * @param buffer_seq Sequence number for ordering
- * @param kind Buffer kind discriminator (see L2SwimlaneBufferKind)
- * @return 0 on success, -1 if queue full
- */
-static int enqueue_ready_buffer(
-    L2SwimlaneDataHeader *header, int thread_idx, uint32_t core_index, uint64_t buffer_ptr, uint32_t buffer_seq,
-    L2SwimlaneBufferKind kind
-) {
-    uint32_t capacity = PLATFORM_PROF_READYQUEUE_SIZE;
-    uint32_t current_tail = 0;
-    uint32_t current_head = 0;
-
-    if (!wait_for_ready_queue_space(header, thread_idx, &current_tail, &current_head)) {
-        return -1;
-    }
-    uint32_t next_tail = (current_tail + 1) % capacity;
-
-    header->queues[thread_idx][current_tail].core_index = core_index;
-    header->queues[thread_idx][current_tail].kind = kind;
-    header->queues[thread_idx][current_tail].buffer_ptr = buffer_ptr;
-    header->queues[thread_idx][current_tail].buffer_seq = buffer_seq;
-    wmb();  // publish: entry fields visible before the tail advance
-    header->queue_tails[thread_idx] = next_tail;
-
-    return 0;
-}
-
-static L2SwimlaneAicpuTaskBuffer *
-try_pop_records_buffer(int core_id, L2SwimlaneAicpuTaskPool *state, uint32_t next_seq) {
-    uint32_t head = 0;
-    uint32_t tail = 0;
-    if (!wait_for_free_queue_entry(&state->free_queue, &head, &tail)) {
-        return nullptr;
-    }
-
-    uint64_t new_buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT];
-    rmb();
-    state->free_queue.head = head + 1;
-    if (new_buf_ptr == 0) {
-        return nullptr;
-    }
-
-    auto *new_buf = reinterpret_cast<L2SwimlaneAicpuTaskBuffer *>(new_buf_ptr);
-    new_buf->count = 0;
-    wmb();
-
-    state->head.current_buf_ptr = new_buf_ptr;
-    state->head.current_buf_seq = next_seq;
-    s_current_aicpu_task_buffers[core_id] = new_buf;
-    wmb();
-    return new_buf;
-}
-
-void l2_swimlane_aicpu_init(int worker_count) {
-    // Reset cross-launch state up front. AICPU statics persist across launches
-    // on the same loaded .so; without this reset, an enabled→disabled launch
-    // sequence would leave s_phase_initialized=true from the prior run, and
-    // any subsequent record_sched_phase / record_orch_phase call would
-    // dereference the prior launch's (now-freed) s_sched_phase_pools /
-    // s_orch_phase_pools pointers. Same shape as the [[block_local]] reset
-    // in onboard/aicore/kernel.cpp for the AICore-side rotation slot
-    // (fixed in #936).
-    s_phase_initialized = false;
-
-    // Reset AICore dispatch-count bookkeeping for the same reason: the next
-    // launch must start counting from 0 so the rotation boundary check
-    // (count % BUFFER_SIZE == 0) lands on the right dispatches. Stale values
-    // from a prior launch would skip the first rotation (count already past a
-    // boundary) or trigger one prematurely.
-    for (int i = 0; i < PLATFORM_MAX_CORES; i++) {
-        s_aicore_dispatched_count[i] = 0;
-    }
-
-    void *l2_swimlane_base = reinterpret_cast<void *>(g_platform_l2_swimlane_base);
-    if (l2_swimlane_base == nullptr) {
-        LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize profiling");
-        return;
-    }
-
-    s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base);
-
-    // Read the granular perf_level from the shared-memory header (host wrote
-    // it in L2SwimlaneCollector::initialize). The kernel-entry setter only seeded
-    // the binary g_enable_l2_swimlane via the bitmask bit.
-    g_l2_swimlane_level = static_cast<L2SwimlaneLevel>(s_l2_swimlane_header->l2_swimlane_level);
-
-    LOG_INFO_V0(
-        "Initializing performance profiling for %d cores (free queue), l2_swimlane_level=%u", worker_count,
-        static_cast<uint32_t>(g_l2_swimlane_level)
-    );
-
-    // Populate the per-core AICore head device-address table. AICore reads
-    // `l2_swimlane_aicore_rotation_table[block_idx]` from KernelArgs to find
-    // its `L2SwimlaneActiveHead` cache line; the table itself is
-    // host-allocated, but the entries are device-internal addresses
-    // (`&ac_state->head`) that the host would otherwise have to translate
-    // from host-mapped to device-mapped. AICPU already runs on the device,
-    // so it can write the addresses directly without any translation — that
-    // keeps the host side decoupled from the AICore shared-memory layout.
-    uint64_t *head_table = reinterpret_cast<uint64_t *>(g_platform_l2_swimlane_aicore_rotation_table);
-
-    // Pop first buffer from free_queue for each core
-    for (int i = 0; i < worker_count; i++) {
-        L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(l2_swimlane_base, i);
-        L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(l2_swimlane_base, worker_count, i);
-
-        s_aicpu_task_pools[i] = state;
-        s_aicore_task_pools[i] = ac_state;
-
-        if (head_table != nullptr) {
-            head_table[i] = reinterpret_cast<uint64_t>(&ac_state->head);
-        }
-
-        // Pop first buffer from free_queue
-        rmb();
-        uint32_t head = state->free_queue.head;
-        uint32_t tail = state->free_queue.tail;
-
-        if (head != tail) {
-            uint64_t buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT];
-            rmb();
-            state->free_queue.head = head + 1;
-            state->head.current_buf_ptr = buf_ptr;
-            state->head.current_buf_seq = 0;
-            wmb();
-
-            L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast<L2SwimlaneAicpuTaskBuffer *>(buf_ptr);
-            buf->count = 0;
-            s_current_aicpu_task_buffers[i] = buf;
-
-            LOG_DEBUG("Core %d: popped initial buffer (addr=0x%lx)", i, buf_ptr);
-        } else {
-            LOG_ERROR("Core %d: free_queue is empty during init!", i);
-            state->head.current_buf_ptr = 0;
-            s_current_aicpu_task_buffers[i] = nullptr;
-        }
-
-        // Prime the AICore head channel with the initial buffer. Seq starts
-        // at 0; AICore's local `cached_buf_seq` defaults to UINT32_MAX so the
-        // first record_task call observes a mismatch and loads the buffer.
-        rmb();
-        uint32_t ac_head = ac_state->free_queue.head;
-        uint32_t ac_tail = ac_state->free_queue.tail;
-        if (ac_head != ac_tail) {
-            uint64_t ac_buf_ptr = ac_state->free_queue.buffer_ptrs[ac_head % PLATFORM_PROF_SLOT_COUNT];
-            rmb();
-            ac_state->free_queue.head = ac_head + 1;
-            // Same publish pattern as aicore_rotate: ptr first, then a fence,
-            // then seq. AICore lazy-resolves the head on its first task, so
-            // strict ordering here matters only if AICore is ever changed to
-            // start polling before the first dispatch — keeping the patterns
-            // aligned future-proofs that.
-            ac_state->head.current_buf_ptr = ac_buf_ptr;
-            wmb();
-            ac_state->head.current_buf_seq = 0;
-            wmb();
-            L2SwimlaneAicoreTaskBuffer *ac_buf = reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(ac_buf_ptr);
-            ac_buf->count = 0;
-            LOG_DEBUG("Core %d: primed AICore head with buf=0x%lx, seq=0", i, ac_buf_ptr);
-        } else {
-            LOG_ERROR("Core %d: AICore free_queue is empty during init!", i);
-            ac_state->head.current_buf_ptr = 0;
-            ac_state->head.current_buf_seq = 0;
-            wmb();
-        }
-    }
-
-    wmb();
-
-    LOG_INFO_V0("Performance profiling initialized for %d cores (with AICore rotation)", worker_count);
-}
-
-/**
- * Internal records-buffer rotation. Called from `l2_swimlane_aicpu_complete_task`
- * after a record is committed and the buffer hits capacity. Only swaps an
- * AICPU-private records pointer — AICore reads from a stable ring and is
- * unaffected by this call.
- */
-static void switch_records_buffer(int core_id, int thread_idx) {
-    L2SwimlaneAicpuTaskPool *state = s_aicpu_task_pools[core_id];
-    if (state == nullptr) {
-        return;
-    }
-
-    L2SwimlaneAicpuTaskBuffer *full_buf = s_current_aicpu_task_buffers[core_id];
-    if (full_buf == nullptr) {
-        return;
-    }
-
-    LOG_INFO_V0("Thread %d: Core %d buffer is full (count=%u)", thread_idx, core_id, full_buf->count);
-
-    uint32_t seq = state->head.current_buf_seq;
-    uint64_t full_buf_ptr = state->head.current_buf_ptr;
-    int rc = enqueue_ready_buffer(
-        s_l2_swimlane_header, thread_idx, core_id, full_buf_ptr, seq, L2SwimlaneBufferKind::AicpuTask
-    );
-    if (rc != 0) {
-        LOG_ERROR("Thread %d: Core %d failed to enqueue buffer (queue full), data lost!", thread_idx, core_id);
-        state->head.dropped_record_count = state->head.dropped_record_count + full_buf->count;
-        full_buf->count = 0;
-        wmb();
-        return;
-    }
-
-    uint32_t next_seq = seq + 1;
-    state->head.current_buf_ptr = 0;
-    state->head.current_buf_seq = next_seq;
-    s_current_aicpu_task_buffers[core_id] = nullptr;
-    wmb();
-
-    L2SwimlaneAicpuTaskBuffer *new_buf = try_pop_records_buffer(core_id, state, next_seq);
-    if (new_buf == nullptr) {
-        return;
-    }
-
-    LOG_INFO_V0(
-        "Thread %d: Core %d switched to new buffer (addr=0x%lx)", thread_idx, core_id,
-        reinterpret_cast<uint64_t>(new_buf)
-    );
-}
-
-// Try to rotate the AICore buffer for `core_id`. Called from the completion
-// path after a successful L2SwimlaneAicpuTaskRecord commit so the just-FIN'd task's
-// AICore record is guaranteed to be in the old buffer before we enqueue it.
-// On success bumps `ac_state->head.current_buf_seq`; on failure (empty free queue
-// or full ready queue) the old buffer is abandoned in place, AICore overflows
-// it from now on, and the drop count grows.
-static void aicore_rotate(int core_id, int thread_idx) {
-    L2SwimlaneAicoreTaskPool *ac_state = s_aicore_task_pools[core_id];
-    if (ac_state == nullptr) {
-        return;
-    }
-
-    uint64_t old_buf_ptr = ac_state->head.current_buf_ptr;
-    uint32_t seq = ac_state->head.current_buf_seq;
-
-    uint32_t head = 0;
-    uint32_t tail = 0;
-    if (!wait_for_free_queue_entry(&ac_state->free_queue, &head, &tail)) {
-        // No replacement available — AICore continues to write into the old
-        // buffer; its slot counter will hit BUFFER_SIZE and the slot guard
-        // silently drops further records. We deliberately do NOT bump
-        // dropped_record_count here: AICPU has no precise view of how many
-        // tasks will actually fall in this gap before the run ends. The
-        // pre-emptive BUFFER_SIZE bump that used to live here over-counted
-        // when the run ended early — the old buffer's already-written
-        // records still flushed (counted toward `collected`), and the
-        // pre-emptive bump on top of that broke the
-        // `collected + dropped == total` reconcile invariant. The drop is
-        // visible at reconcile time as silent loss
-        // (`total - collected - dropped > 0`) and the WARN below records
-        // the failure mode.
-        LOG_WARN(
-            "Thread %d: Core %d AICore free_queue empty at rotation; AICore slot guard will drop overflow records",
-            thread_idx, core_id
-        );
-        return;
-    }
-
-    uint64_t new_buf_ptr = ac_state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT];
-    rmb();
-    if (new_buf_ptr == 0) {
-        LOG_WARN(
-            "Thread %d: Core %d AICore free_queue returned a null buffer at rotation; keeping old buffer active",
-            thread_idx, core_id
-        );
-        return;
-    }
-
-    // Enqueue the just-filled AICore buffer with count = BUFFER_SIZE.
-    if (old_buf_ptr != 0) {
-        L2SwimlaneAicoreTaskBuffer *old_buf = reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(old_buf_ptr);
-        old_buf->count = static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
-        wmb();
-        int rc = enqueue_ready_buffer(
-            s_l2_swimlane_header, thread_idx, core_id, old_buf_ptr, seq, L2SwimlaneBufferKind::AicoreTask
-        );
-        if (rc != 0) {
-            // Ready queue full — we leave current_buf_ptr pointing at the
-            // old buffer so the run-end flush path retries the enqueue (the
-            // host is draining concurrently; the queue may have space by
-            // then). We deliberately do NOT bump dropped here for the same
-            // reason as the empty-free-queue branch: counting a drop now
-            // would double-count if the flush succeeds in delivering the
-            // buffer to the host. Reconcile reports the actual loss as
-            // silent_loss when neither this rotation nor the flush
-            // delivers the records.
-            LOG_ERROR(
-                "Thread %d: Core %d failed to enqueue AICore buffer at rotation (queue full); will retry at flush",
-                thread_idx, core_id
-            );
-            return;
-        }
-    }
-
-    // Pop next buffer from free_queue and publish via the head channel.
-    // Publish order matters: AICore observes head.current_buf_seq change to
-    // detect rotation, then reads head.current_buf_ptr. Write ptr first so
-    // AICore can never see a new seq with a stale ptr. new_buf->count=0 must
-    // also be visible before AICore's slot writes begin.
-    ac_state->free_queue.head = head + 1;
-    L2SwimlaneAicoreTaskBuffer *new_buf = reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(new_buf_ptr);
-    new_buf->count = 0;
-
-    wmb();
-    ac_state->head.current_buf_ptr = new_buf_ptr;
-    wmb();
-    ac_state->head.current_buf_seq = seq + 1;
-    wmb();
-}
-
-// Pre-dispatch hook. Called from the dispatch path (scheduler_dispatch in
-// tensormap_and_ringbuffer; aicpu_executor in host_build_graph) immediately
-// before `write_reg(DATA_MAIN_BASE)` for each AICore task. Maintains the
-// per-core dispatch count and rotates the AICore buffer when the count is
-// about to cross a PLATFORM_AICORE_BUFFER_SIZE boundary.
-//
-// Race safety: rotation runs before the dispatch register write. The
-// completion-before-dispatch invariant (AICore per core is single-threaded
-// and AICPU does not dispatch task K+1 until K FIN'd) guarantees AICore has
-// already finished writing — and dcci'd out — every record in the old buffer
-// by then. AICPU can safely enqueue the old buffer to the ready queue.
-//
-// total_record_count accounting also lives here: one AICore record == one
-// dispatch, so the dispatch count IS the AICore-side total. Bumping here
-// (instead of inside complete_task) means level=1 (AICORE_TIMING-only) gets
-// accurate reconcile counts even when complete_task is bypassed.
-void l2_swimlane_aicpu_on_aicore_dispatch(int core_id, int thread_idx) {
-    if (!g_enable_l2_swimlane) {
-        return;
-    }
-    if (core_id < 0 || core_id >= PLATFORM_MAX_CORES) {
-        return;
-    }
-    L2SwimlaneAicoreTaskPool *ac_state = s_aicore_task_pools[core_id];
-    if (ac_state == nullptr) {
-        return;
-    }
-    uint32_t prev = s_aicore_dispatched_count[core_id];
-    // Rotate exactly on the first dispatch of each non-initial BUFFER_SIZE
-    // batch (prev = BUFFER_SIZE, 2*BUFFER_SIZE, ...). PLATFORM_AICORE_BUFFER_SIZE
-    // is asserted power-of-two so the mod lowers to a bitwise AND.
-    if (prev > 0 && (prev & (PLATFORM_AICORE_BUFFER_SIZE - 1)) == 0) {
-        aicore_rotate(core_id, thread_idx);
-    }
-    s_aicore_dispatched_count[core_id] = prev + 1;
-    ac_state->head.total_record_count += 1;
-}
-
-int l2_swimlane_aicpu_complete_task(
-    int core_id, int thread_idx, uint32_t reg_task_id, uint64_t dispatch_time, uint64_t finish_time
-) {
-    if (core_id < 0 || core_id >= PLATFORM_MAX_CORES) {
-        return -1;
-    }
-    L2SwimlaneAicpuTaskPool *state = s_aicpu_task_pools[core_id];
-    if (state == nullptr) {
-        return -1;
-    }
-
-    // Account every commit attempt up front so host can detect silent loss as
-    // `device_total - (collected + dropped)`.
-    state->head.total_record_count += 1;
-
-    L2SwimlaneAicpuTaskBuffer *l2_swimlane_buf = s_current_aicpu_task_buffers[core_id];
-    if (l2_swimlane_buf == nullptr) {
-        l2_swimlane_buf = try_pop_records_buffer(core_id, state, state->head.current_buf_seq);
-        if (l2_swimlane_buf == nullptr) {
-            // No active records buffer (init ran out of free buffers or host has
-            // not refilled after the last published full buffer); count as drop
-            // so host reconciliation stays consistent.
-            state->head.dropped_record_count += 1;
-            return -1;
-        }
-    }
-    uint32_t count = l2_swimlane_buf->count;
-    if (count >= PLATFORM_PROF_BUFFER_SIZE) {
-        // Defensive: should not happen because we rotate at end of every commit.
-        state->head.dropped_record_count += 1;
-        return -1;
-    }
-
-    // AICPU-only timing — three fields, two cache half-lines. Identity
-    // (task_token_raw, core_type) lives in the AICore record; the host
-    // joins by reg_task_id. See L2SwimlaneAicpuTaskRecord header comment.
-    L2SwimlaneAicpuTaskRecord *record = &l2_swimlane_buf->records[count];
-    record->reg_task_id = reg_task_id;
-    record->dispatch_time = dispatch_time;
-    record->finish_time = finish_time;
-
-    uint32_t new_count = count + 1;
-    l2_swimlane_buf->count = new_count;
-    wmb();
-
-    // Rotate AICpu's L2SwimlaneAicpuTaskBuffer after the write so the just-committed
-    // record is preserved.
-    if (new_count >= PLATFORM_PROF_BUFFER_SIZE) {
-        switch_records_buffer(core_id, thread_idx);
-    }
-
-    // AICore-pool stats (total_record_count) are bumped on the dispatch side,
-    // not here. See l2_swimlane_aicpu_on_aicore_dispatch — counting per
-    // dispatch keeps reconcile counts accurate even at level=1 where this
-    // function never runs.
-    return 0;
-}
-
-void l2_swimlane_aicpu_flush(int thread_idx, const int *cur_thread_cores, int core_num) {
-    if (!g_enable_l2_swimlane) {
-        return;
-    }
-
-    void *l2_swimlane_base = reinterpret_cast<void *>(g_platform_l2_swimlane_base);
-    if (l2_swimlane_base == nullptr) {
-        return;
-    }
-
-    rmb();
-
-    LOG_INFO_V0("Thread %d: Flushing performance buffers for %d cores", thread_idx, core_num);
-
-    int flushed_count = 0;
-
-    for (int i = 0; i < core_num; i++) {
-        int core_id = cur_thread_cores[i];
-        L2SwimlaneAicpuTaskPool *state = s_aicpu_task_pools[core_id];
-        if (state == nullptr) continue;
-
-        rmb();
-        uint64_t buf_ptr = state->head.current_buf_ptr;
-        if (buf_ptr == 0) {
-            // No active buffer
-        } else {
-            L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast<L2SwimlaneAicpuTaskBuffer *>(buf_ptr);
-            if (buf->count > 0) {
-                uint32_t seq = state->head.current_buf_seq;
-                int rc = enqueue_ready_buffer(
-                    s_l2_swimlane_header, thread_idx, core_id, buf_ptr, seq, L2SwimlaneBufferKind::AicpuTask
-                );
-                if (rc == 0) {
-                    LOG_INFO_V0("Thread %d: Core %d flushed buffer with %u records", thread_idx, core_id, buf->count);
-                    flushed_count++;
-                    state->head.current_buf_ptr = 0;
-                    s_current_aicpu_task_buffers[core_id] = nullptr;
-                    wmb();
-                } else {
-                    // ready_queue full at end-of-run: account the loss and clear the
-                    // buffer so host reconcile sees a clean state (current_buf_ptr=0)
-                    // and dropped == flush failures rather than ring/task_id mismatch.
-                    LOG_ERROR(
-                        "Thread %d: Core %d failed to enqueue buffer (queue full), %u records lost!", thread_idx,
-                        core_id, buf->count
-                    );
-                    state->head.dropped_record_count = state->head.dropped_record_count + buf->count;
-                    buf->count = 0;
-                    state->head.current_buf_ptr = 0;
-                    s_current_aicpu_task_buffers[core_id] = nullptr;
-                    wmb();
-                }
-            }
-        }
-
-        // Also flush the current AICore buffer to the ready queue so the host
-        // sees this session's final batch of AICore timestamps.
-        //
-        // High-water mark uses the rotation accounting (total_record_count -
-        // current_buf_seq * BUFFER_SIZE). total_record_count is bumped per
-        // dispatch in l2_swimlane_aicpu_on_aicore_dispatch and is therefore
-        // accurate at all levels — including level=1 where complete_task is
-        // bypassed. The formula clamps to BUFFER_SIZE if an earlier rotation
-        // failed (no free buffer), so we never stamp a partial count when
-        // the buffer is actually full.
-        L2SwimlaneAicoreTaskPool *ac_state = s_aicore_task_pools[core_id];
-        if (ac_state == nullptr) continue;
-
-        rmb();
-        uint64_t ac_buf_ptr = ac_state->head.current_buf_ptr;
-        if (ac_buf_ptr == 0) continue;
-
-        // At AICPU_TIMING+, `total_record_count` is bumped on every complete
-        // and gives an accurate live count for the current buffer. At
-        // AICORE_TIMING (level=1) complete_task is skipped, so that counter
-        // stays 0 and the formula bails even when AICore has filled records.
-        // Fall back to the buffer's full capacity in that case; the host-side
-        // copy_aicore_buffer skips trailing slots whose start_time is still 0,
-        // so over-stating count costs only a scan pass — never spurious records.
-        uint32_t ac_mark;
-        if (g_l2_swimlane_level >= L2SwimlaneLevel::AICPU_TIMING) {
-            uint32_t live = ac_state->head.total_record_count -
-                            ac_state->head.current_buf_seq * static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
-            if (live == 0) {
-                continue;
-            }
-            ac_mark = (live > static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE)) ?
-                          static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE) :
-                          live;
-        } else {
-            ac_mark = static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
-        }
-        L2SwimlaneAicoreTaskBuffer *ac_buf = reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(ac_buf_ptr);
-        ac_buf->count = ac_mark;
-        wmb();
-
-        uint32_t ac_seq = ac_state->head.current_buf_seq;
-        int rc = enqueue_ready_buffer(
-            s_l2_swimlane_header, thread_idx, core_id, ac_buf_ptr, ac_seq, L2SwimlaneBufferKind::AicoreTask
-        );
-        if (rc == 0) {
-            LOG_INFO_V0(
-                "Thread %d: Core %d flushed AICore buffer (seq=%u, count=%u)", thread_idx, core_id, ac_seq, ac_mark
-            );
-            ac_state->head.current_buf_ptr = 0;
-            wmb();
-        } else {
-            LOG_ERROR("Thread %d: Core %d failed to enqueue AICore buffer at flush (queue full)", thread_idx, core_id);
-            ac_state->head.dropped_record_count = ac_state->head.dropped_record_count + ac_mark;
-            ac_state->head.current_buf_ptr = 0;
-            wmb();
-        }
-    }
-
-    wmb();
-
-    LOG_INFO_V0("Thread %d: Performance buffer flush complete, %d buffers flushed", thread_idx, flushed_count);
-}
-
-// Pop the first buffer from a pool's free_queue and cache it as the current
-// active buffer. Shared init helper for sched and orch phase pool priming.
-// Returns the popped buffer ptr (nullptr if free_queue was empty).
-template <typename Buffer>
-static Buffer *prime_phase_pool(L2SwimlaneAicpuTaskPool *state, int thread_idx, const char *kind_label) {
-    rmb();
-    uint32_t head = state->free_queue.head;
-    uint32_t tail = state->free_queue.tail;
-
-    if (head != tail) {
-        uint64_t buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT];
-        rmb();
-        state->free_queue.head = head + 1;
-        state->head.current_buf_ptr = buf_ptr;
-        state->head.current_buf_seq = 0;
-        wmb();
-
-        auto *buf = reinterpret_cast<Buffer *>(buf_ptr);
-        buf->count = 0;
-        LOG_DEBUG("Thread %d: popped initial %s phase buffer (addr=0x%lx)", thread_idx, kind_label, buf_ptr);
-        return buf;
-    }
-    LOG_ERROR("Thread %d: %s phase free_queue is empty during init!", thread_idx, kind_label);
-    state->head.current_buf_ptr = 0;
-    return nullptr;
-}
-
-void l2_swimlane_aicpu_init_phase(int worker_count, int num_sched_phase_threads, int num_orch_phase_threads) {
-    void *l2_swimlane_base = reinterpret_cast<void *>(g_platform_l2_swimlane_base);
-    if (l2_swimlane_base == nullptr) {
-        LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize phase profiling");
-        return;
-    }
-
-    s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base);
-
-    s_l2_swimlane_header->num_sched_phase_threads = static_cast<uint32_t>(num_sched_phase_threads);
-    s_l2_swimlane_header->num_orch_phase_threads = static_cast<uint32_t>(num_orch_phase_threads);
-    s_l2_swimlane_header->num_phase_cores = 0;
-    memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread));
-    s_phase_initialized = true;
-
-    int sched_n = num_sched_phase_threads;
-    if (sched_n > PLATFORM_MAX_AICPU_THREADS) sched_n = PLATFORM_MAX_AICPU_THREADS;
-    int orch_n = num_orch_phase_threads;
-    if (orch_n > PLATFORM_MAX_AICPU_THREADS) orch_n = PLATFORM_MAX_AICPU_THREADS;
-
-    for (int t = 0; t < sched_n; t++) {
-        auto *state = get_sched_phase_buffer_state(l2_swimlane_base, worker_count, t);
-        s_sched_phase_pools[t] = state;
-        s_current_sched_phase_buffers[t] = prime_phase_pool<L2SwimlaneAicpuSchedPhaseBuffer>(state, t, "sched");
-    }
-    for (int t = sched_n; t < PLATFORM_MAX_AICPU_THREADS; t++) {
-        s_sched_phase_pools[t] = nullptr;
-        s_current_sched_phase_buffers[t] = nullptr;
-    }
-
-    for (int t = 0; t < orch_n; t++) {
-        auto *state = get_orch_phase_buffer_state(l2_swimlane_base, worker_count, t);
-        s_orch_phase_pools[t] = state;
-        s_current_orch_phase_buffers[t] = prime_phase_pool<L2SwimlaneAicpuOrchPhaseBuffer>(state, t, "orch");
-    }
-    for (int t = orch_n; t < PLATFORM_MAX_AICPU_THREADS; t++) {
-        s_orch_phase_pools[t] = nullptr;
-        s_current_orch_phase_buffers[t] = nullptr;
-    }
-
-    wmb();
-
-    LOG_INFO_V0(
-        "Phase profiling initialized: %d sched threads, %d orch threads, %d records/thread", num_sched_phase_threads,
-        num_orch_phase_threads, PLATFORM_PHASE_RECORDS_PER_THREAD
-    );
-}
-
-// Generic phase-buffer switch. Enqueue the full buffer to its thread's
-// ready queue under `kind`, then pop a fresh buffer from free_queue. Sets
-// `*current_buf_out` to nullptr if no free buffer is available — subsequent
-// records on that thread will drop until the host catches up.
-// `thread_idx` is the AICPU thread doing the enqueue (always the caller); it
-// selects that thread's own SPSC ready queue, which it must own exclusively.
-// `pool_idx` is the pool ordinal the host uses to file records and recycle the
-// buffer to that pool (the same ordinal indexes the output lane). For sched
-// pools the two coincide (thread t → queue t, pool t); for the single orch
-// instance they differ (orchestrator's thread, but pool ordinal 0).
-template <typename Buffer>
-static void switch_phase_buffer_kind(
-    int thread_idx, uint32_t pool_idx, L2SwimlaneAicpuTaskPool *state, Buffer **current_buf_out,
-    L2SwimlaneBufferKind kind, const char *kind_label
-) {
-    Buffer *full_buf = *current_buf_out;
-    if (state == nullptr || full_buf == nullptr) return;
-
-    LOG_INFO_V0("Thread %d: %s phase buffer is full (count=%u)", thread_idx, kind_label, full_buf->count);
-
-    uint32_t seq = state->head.current_buf_seq;
-    int rc = enqueue_ready_buffer(s_l2_swimlane_header, thread_idx, pool_idx, state->head.current_buf_ptr, seq, kind);
-    if (rc != 0) {
-        LOG_ERROR(
-            "Thread %d: failed to enqueue %s phase buffer (queue full), %u records lost!", thread_idx, kind_label,
-            full_buf->count
-        );
-        state->head.dropped_record_count += full_buf->count;
-        full_buf->count = 0;
-        wmb();
-        return;
-    }
-
-    uint32_t head = 0;
-    uint32_t tail = 0;
-    if (wait_for_free_queue_entry(&state->free_queue, &head, &tail)) {
-        uint64_t new_buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT];
-        rmb();
-        state->free_queue.head = head + 1;
-        if (new_buf_ptr == 0) {
-            *current_buf_out = nullptr;
-            state->head.current_buf_ptr = 0;
-            wmb();
-            return;
-        }
-        state->head.current_buf_ptr = new_buf_ptr;
-        state->head.current_buf_seq = seq + 1;
-        wmb();
-
-        Buffer *new_buf = reinterpret_cast<Buffer *>(new_buf_ptr);
-        new_buf->count = 0;
-        *current_buf_out = new_buf;
-        LOG_INFO_V0("Thread %d: switched to new %s phase buffer", thread_idx, kind_label);
-    } else {
-        LOG_WARN(
-            "Thread %d: no free %s phase buffer available, dropping records until Host catches up", thread_idx,
-            kind_label
-        );
-        *current_buf_out = nullptr;
-        state->head.current_buf_ptr = 0;
-        wmb();
-    }
-}
-
-// Acquire a writable slot in the per-thread phase buffer. Handles the
-// nullptr-recover path (a prior switch couldn't pop a free buffer) and the
-// buffer-full → switch path. Returns nullptr if the record must be dropped;
-// callers should bump `dropped_record_count` and return when nullptr.
-template <typename Buffer, typename Record>
-static Record *acquire_phase_slot(
-    int thread_idx, uint32_t pool_idx, L2SwimlaneAicpuTaskPool *state, Buffer **current_buf_out,
-    L2SwimlaneBufferKind kind, const char *kind_label
-) {
-    Buffer *buf = *current_buf_out;
-    if (buf == nullptr) {
-        uint32_t head = 0;
-        uint32_t tail = 0;
-        if (wait_for_free_queue_entry(&state->free_queue, &head, &tail)) {
-            uint64_t buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT];
-            rmb();
-            state->free_queue.head = head + 1;
-            if (buf_ptr == 0) {
-                return nullptr;
-            }
-            state->head.current_buf_ptr = buf_ptr;
-            state->head.current_buf_seq += 1;
-            wmb();
-            buf = reinterpret_cast<Buffer *>(buf_ptr);
-            buf->count = 0;
-            *current_buf_out = buf;
-            LOG_INFO_V0("Thread %d: recovered %s phase buffer", thread_idx, kind_label);
-        }
-        if (buf == nullptr) return nullptr;
-    }
-
-    uint32_t idx = buf->count;
-    if (idx >= PLATFORM_PHASE_RECORDS_PER_THREAD) {
-        switch_phase_buffer_kind(thread_idx, pool_idx, state, current_buf_out, kind, kind_label);
-        buf = *current_buf_out;
-        if (buf == nullptr) return nullptr;
-        idx = buf->count;
-        if (idx >= PLATFORM_PHASE_RECORDS_PER_THREAD) return nullptr;
-    }
-    Record *record = &buf->records[idx];
-    buf->count = idx + 1;
-    return record;
-}
-
-void l2_swimlane_aicpu_record_sched_phase(
-    int thread_idx, L2SwimlaneSchedPhaseKind kind, uint64_t start_time, uint64_t end_time, uint32_t loop_iter,
-    uint32_t tasks_processed, uint32_t pop_hit, uint32_t pop_miss, const int16_t *shared_at_start,
-    const int16_t *shared_at_end
-) {
-    if (!s_phase_initialized) return;
-    auto *state = s_sched_phase_pools[thread_idx];
-    if (state == nullptr) return;
-
-    state->head.total_record_count += 1;
-
-    auto *record = acquire_phase_slot<L2SwimlaneAicpuSchedPhaseBuffer, L2SwimlaneAicpuSchedPhaseRecord>(
-        /*thread_idx=*/thread_idx, /*pool_idx=*/static_cast<uint32_t>(thread_idx), state,
-        &s_current_sched_phase_buffers[thread_idx], L2SwimlaneBufferKind::AicpuSchedPhase, "sched"
-    );
-    if (record == nullptr) {
-        state->head.dropped_record_count += 1;
-        return;
-    }
-    record->start_time = start_time;
-    record->end_time = end_time;
-    record->loop_iter = loop_iter;
-    record->kind = kind;
-    record->tasks_processed = tasks_processed;
-    record->pop_hit = pop_hit;
-    record->pop_miss = pop_miss;
-    auto copy_snapshot = [](int16_t dst[L2SWIMLANE_NUM_QUEUE_SHAPES], const int16_t *src) {
-        if (src == nullptr) {
-            for (int i = 0; i < L2SWIMLANE_NUM_QUEUE_SHAPES; i++)
-                dst[i] = 0;
-        } else {
-            for (int i = 0; i < L2SWIMLANE_NUM_QUEUE_SHAPES; i++)
-                dst[i] = src[i];
-        }
-    };
-    copy_snapshot(record->shared_depth_at_start, shared_at_start);
-    copy_snapshot(record->shared_depth_at_end, shared_at_end);
-}
-
-void l2_swimlane_aicpu_set_orch_thread_idx(int thread_idx) { s_orch_thread_idx = thread_idx; }
-
-void l2_swimlane_aicpu_record_orch_phase(
-    uint64_t start_time, uint64_t end_time, uint64_t task_id, uint32_t submit_idx
-) {
-    if (s_orch_thread_idx < 0 || !s_phase_initialized) return;
-    // Single orch instance (dep_gen / scope_stats style): all orch records
-    // funnel into pool ordinal 0, regardless of which AICPU thread the
-    // orchestrator runs on. s_orch_thread_idx is the orchestrator's AICPU
-    // thread index — used only to pick its own ready queue (SPSC owner); the
-    // entry is tagged with pool ordinal 0 so the host files it into orch lane 0.
-    auto *state = s_orch_phase_pools[0];
-    if (state == nullptr) return;
-
-    state->head.total_record_count += 1;
-
-    auto *record = acquire_phase_slot<L2SwimlaneAicpuOrchPhaseBuffer, L2SwimlaneAicpuOrchPhaseRecord>(
-        /*thread_idx=*/s_orch_thread_idx, /*pool_idx=*/0, state, &s_current_orch_phase_buffers[0],
-        L2SwimlaneBufferKind::AicpuOrchPhase, "orch"
-    );
-    if (record == nullptr) {
-        state->head.dropped_record_count += 1;
-        return;
-    }
-    record->start_time = start_time;
-    record->end_time = end_time;
-    record->task_id = task_id;
-    record->submit_idx = submit_idx;
-}
-
-// Final-drain flush of one phase pool's active buffer. `thread_idx` / `pool_idx`
-// as in switch_phase_buffer_kind.
-static void flush_phase_pool(
-    int thread_idx, uint32_t pool_idx, L2SwimlaneAicpuTaskPool *state, L2SwimlaneBufferKind kind, const char *kind_label
-) {
-    if (state == nullptr) return;
-    rmb();
-    uint64_t buf_ptr = state->head.current_buf_ptr;
-    if (buf_ptr == 0) return;
-    // `count` sits AFTER the records[] array in TypedBuffer, so its byte offset
-    // is N * sizeof(Record) — different for sched (40B) vs orch (32B) records.
-    // Read/write it through the matching buffer type; a single fixed cast reads
-    // past the orch buffer, sees 0, and silently skips the orch flush.
-    volatile uint32_t *count_ptr = (kind == L2SwimlaneBufferKind::AicpuOrchPhase) ?
-                                       &reinterpret_cast<L2SwimlaneAicpuOrchPhaseBuffer *>(buf_ptr)->count :
-                                       &reinterpret_cast<L2SwimlaneAicpuSchedPhaseBuffer *>(buf_ptr)->count;
-    if (*count_ptr == 0) return;
-    uint32_t seq = state->head.current_buf_seq;
-    int rc = enqueue_ready_buffer(s_l2_swimlane_header, thread_idx, pool_idx, buf_ptr, seq, kind);
-    if (rc == 0) {
-        LOG_INFO_V0("Thread %d: flushed %s phase buffer with %u records", thread_idx, kind_label, *count_ptr);
-    } else {
-        LOG_ERROR(
-            "Thread %d: failed to enqueue %s phase buffer (queue full), %u records lost!", thread_idx, kind_label,
-            *count_ptr
-        );
-        state->head.dropped_record_count += *count_ptr;
-        *count_ptr = 0;
-    }
-    state->head.current_buf_ptr = 0;
-    wmb();
-}
-
-// Final-drain flush of the scheduler-phase pool owned by this scheduler thread.
-void l2_swimlane_aicpu_flush_sched_phase_buffer(int thread_idx) {
-    if (!s_phase_initialized || s_l2_swimlane_header == nullptr) return;
-    flush_phase_pool(
-        thread_idx, static_cast<uint32_t>(thread_idx), s_sched_phase_pools[thread_idx],
-        L2SwimlaneBufferKind::AicpuSchedPhase, "sched"
-    );
-    s_current_sched_phase_buffers[thread_idx] = nullptr;
-}
-
-// Final-drain flush of the single orchestrator's orch-phase pool (ordinal 0).
-// Called once by the orchestrator thread at orchestration end; see
-// record_orch_phase for the pool-0 / own-ready-queue tagging.
-void l2_swimlane_aicpu_flush_orch_phase_buffer(int thread_idx) {
-    if (!s_phase_initialized || s_l2_swimlane_header == nullptr) return;
-    flush_phase_pool(thread_idx, /*pool_idx=*/0, s_orch_phase_pools[0], L2SwimlaneBufferKind::AicpuOrchPhase, "orch");
-    s_current_orch_phase_buffers[0] = nullptr;
-}
-
-void l2_swimlane_aicpu_init_core_assignments(int total_cores) {
-    if (!s_phase_initialized) {
-        return;
-    }
-    memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread));
-    s_l2_swimlane_header->num_phase_cores = static_cast<uint32_t>(total_cores);
-    wmb();
-    LOG_INFO_V0("Core-to-thread mapping init: %d cores", total_cores);
-}
-
-void l2_swimlane_aicpu_write_core_assignments_for_thread(int thread_idx, const int *core_ids, int core_num) {
-    if (!s_phase_initialized) {
-        return;
-    }
-    for (int i = 0; i < core_num; i++) {
-        int core_id = core_ids[i];
-        if (core_id >= 0 && core_id < PLATFORM_MAX_CORES) {
-            s_l2_swimlane_header->core_to_thread[core_id] = static_cast<int8_t>(thread_idx);
-        }
-    }
-    wmb();
-}
diff --git a/src/a5/platform/sim/host/CMakeLists.txt b/src/a5/platform/sim/host/CMakeLists.txt
index 396ef1edf..6a5ca9e3a 100644
--- a/src/a5/platform/sim/host/CMakeLists.txt
+++ b/src/a5/platform/sim/host/CMakeLists.txt
@@ -45,9 +45,9 @@ list(APPEND HOST_RUNTIME_SOURCES
     "${CMAKE_CURRENT_SOURCE_DIR}/profiling_copy.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp"
-    "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp"
+    "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp"
     "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/sim/aicpu/platform_aicpu_affinity.cpp"
diff --git a/src/a5/platform/include/host/dep_gen_collector.h b/src/common/platform/include/host/dep_gen_collector.h
similarity index 94%
rename from src/a5/platform/include/host/dep_gen_collector.h
rename to src/common/platform/include/host/dep_gen_collector.h
index 6b8f8cfb8..dddab9cd4 100644
--- a/src/a5/platform/include/host/dep_gen_collector.h
+++ b/src/common/platform/include/host/dep_gen_collector.h
@@ -11,16 +11,16 @@
 
 /**
  * @file dep_gen_collector.h
- * @brief Host-side dep_gen (SubmitTrace) buffer allocation, streaming
- *        collection, and raw binary export.
+ * @brief Host-side dep_gen (SubmitTrace) buffer allocation and streaming
+ *        collection for in-memory replay.
  *
  * Architecture:
  * - BufferPoolManager<DepGenModule>: shared mgmt-thread infrastructure that
  *   polls per-thread ready queues, drains done-queue shards, and replenishes
  *   the single instance's free_queue from a unified recycled pool.
  * - DepGenCollector: collector thread shards pop full DepGenBuffers from the
- *   manager and append their DepGenRecords to a binary file
- *   (submit_trace.bin).
+ *   manager and append their DepGenRecords to an in-memory vector consumed by
+ *   host replay after device execution completes.
  *
  * Lifecycle:
  *   init()                       — Allocate header + 1 BufferState + N DepGenBuffers
@@ -36,14 +36,13 @@
  *                                  (incomplete graph; user gets a warning).
  *   finalize()                   — Free all device memory, unregister.
  *
- * Output format (submit_trace.bin): a fixed-size header followed by a
- * contiguous stream of DepGenRecord values. Replay (future PR) reads this
- * back. Layout intentionally trivial (no varint / framing) so the
- * `sizeof(DepGenRecord)` ABI in `common/dep_gen.h` is the only contract.
+ * Output contract: a contiguous in-memory stream of DepGenRecord values.
+ * Host replay consumes this stream directly; no submit_trace.bin intermediary
+ * is written by the collector.
  */
 
-#ifndef SRC_A5_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
-#define SRC_A5_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
+#ifndef SRC_COMMON_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
+#define SRC_COMMON_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
 
 #include <atomic>
 #include <cstddef>
@@ -180,7 +179,7 @@ class DepGenCollector : public profiling_common::ProfilerBase<DepGenCollector, D
      * @param num_threads     Number of AICPU scheduling threads (so the
      *                        DataHeader sizes its per-thread ready queues)
      * @param alloc_cb        Memory allocation callback
-     * @param register_cb     halHostRegister callback (nullptr on a5)
+     * @param register_cb     halHostRegister callback (nullptr on non-SVM platforms)
      * @param free_cb         Memory free callback
      * @param device_id       Device ID
      * @return 0 on success, non-zero on failure
@@ -279,4 +278,4 @@ inline std::string make_deps_json_path(const std::string &output_dir) {
     return (dir / "deps.json").string();
 }
 
-#endif  // SRC_A5_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
+#endif  // SRC_COMMON_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_
diff --git a/src/a5/platform/include/host/l2_swimlane_collector.h b/src/common/platform/include/host/l2_swimlane_collector.h
similarity index 98%
rename from src/a5/platform/include/host/l2_swimlane_collector.h
rename to src/common/platform/include/host/l2_swimlane_collector.h
index 44d755611..36d1e7d07 100644
--- a/src/a5/platform/include/host/l2_swimlane_collector.h
+++ b/src/common/platform/include/host/l2_swimlane_collector.h
@@ -23,8 +23,8 @@
  * Memory operations are injected through callbacks for sim/onboard portability.
  */
 
-#ifndef SRC_A5_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
-#define SRC_A5_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
+#ifndef SRC_COMMON_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
+#define SRC_COMMON_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
 
 #include <atomic>
 #include <array>
@@ -325,7 +325,7 @@ class L2SwimlaneCollector : public profiling_common::ProfilerBase<L2SwimlaneColl
      *                                 JSON `version`.
      * @param alloc_cb                 Device memory allocation callback
      * @param register_cb              Memory registration callback (nullptr for
-     *                                 simulation)
+     *                                 simulation and non-SVM platforms)
      * @param free_cb                  Device memory free callback
      * @param user_data                Opaque pointer forwarded to callbacks
      * @param output_prefix            Per-task directory; l2_swimlane_records.json
@@ -493,4 +493,4 @@ class L2SwimlaneCollector : public profiling_common::ProfilerBase<L2SwimlaneColl
     void copy_aicore_buffer(const ReadyBufferInfo &info);
 };
 
-#endif  // SRC_A5_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
+#endif  // SRC_COMMON_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_
diff --git a/src/a2a3/platform/shared/aicpu/dep_gen_collector_aicpu.cpp b/src/common/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
similarity index 100%
rename from src/a2a3/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
rename to src/common/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
diff --git a/src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp b/src/common/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
similarity index 100%
rename from src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
rename to src/common/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
diff --git a/src/a5/platform/shared/host/dep_gen_collector.cpp b/src/common/platform/shared/host/dep_gen_collector.cpp
similarity index 97%
rename from src/a5/platform/shared/host/dep_gen_collector.cpp
rename to src/common/platform/shared/host/dep_gen_collector.cpp
index abafa24cc..7d60b8f84 100644
--- a/src/a5/platform/shared/host/dep_gen_collector.cpp
+++ b/src/common/platform/shared/host/dep_gen_collector.cpp
@@ -19,10 +19,10 @@
  *        consumed directly by the host replay — no on-disk submit_trace.bin
  *        intermediary.
  *
- * a5 specifics: device↔host transfers go through profiling_copy.h. Each
- * DepGenBuffer's contents are pulled from device on demand inside
- * ProfilerAlgorithms::process_entry, so on_buffer_collected can read
- * `count` and `records[]` directly off the host shadow.
+ * Non-SVM platforms route device↔host transfers through profiling_copy.h.
+ * Each DepGenBuffer's contents are pulled from device on demand inside
+ * ProfilerAlgorithms::process_entry, so on_buffer_collected can read `count`
+ * and `records[]` directly off the host shadow.
  */
 
 #include "host/dep_gen_collector.h"
@@ -61,7 +61,7 @@ int DepGenCollector::init(
     // consistent values during init. shm_host_ stays nullptr until the shm
     // allocation succeeds — start(tf) gates on shm_host_.
     set_memory_context(
-        alloc_cb, register_cb, free_cb, profiling_copy_to_device_for_ops, profiling_copy_from_device_for_ops,
+        alloc_cb, register_cb, free_cb, profiling_copy_to_device_or_null(), profiling_copy_from_device_or_null(),
         /*shm_dev=*/nullptr, /*shm_host=*/nullptr, /*shm_size=*/0, device_id
     );
 
@@ -131,7 +131,7 @@ int DepGenCollector::init(
     shm_dev_ = shm_dev_local;
     shm_size_ = shm_size;
     set_memory_context(
-        alloc_cb, register_cb, free_cb, profiling_copy_to_device_for_ops, profiling_copy_from_device_for_ops,
+        alloc_cb, register_cb, free_cb, profiling_copy_to_device_or_null(), profiling_copy_from_device_or_null(),
         shm_dev_local, shm_host_local, shm_size, device_id
     );
     initialized_ = true;
diff --git a/src/a5/platform/shared/host/l2_swimlane_collector.cpp b/src/common/platform/shared/host/l2_swimlane_collector.cpp
similarity index 93%
rename from src/a5/platform/shared/host/l2_swimlane_collector.cpp
rename to src/common/platform/shared/host/l2_swimlane_collector.cpp
index eb258dc2c..3f2098ea4 100644
--- a/src/a5/platform/shared/host/l2_swimlane_collector.cpp
+++ b/src/common/platform/shared/host/l2_swimlane_collector.cpp
@@ -31,6 +31,7 @@
 
 #include "common/memory_barrier.h"
 #include "common/unified_log.h"
+#include "host/profiling_copy.h"
 
 // =============================================================================
 // L2SwimlaneCollector Implementation
@@ -57,8 +58,8 @@ int L2SwimlaneCollector::initialize(
         return -1;
     }
 
-    // register_cb may legitimately be null (a5 has no halHostRegister); alloc
-    // and free callbacks are mandatory. Matches dep_gen / pmu / scope_stats.
+    // register_cb may legitimately be null on simulation / non-SVM platforms;
+    // alloc and free callbacks are mandatory. Matches dep_gen / pmu / scope_stats.
     if (alloc_cb == nullptr || free_cb == nullptr) {
         LOG_ERROR("L2SwimlaneCollector::initialize: alloc_cb/free_cb must be non-null");
         return -1;
@@ -84,15 +85,15 @@ int L2SwimlaneCollector::initialize(
     // shm allocation succeeds — the nullptr guard makes a post-failure
     // start(tf) a no-op.
     set_memory_context(
-        alloc_cb, register_cb, free_cb, profiling_copy_to_device_for_ops, profiling_copy_from_device_for_ops,
+        alloc_cb, register_cb, free_cb, profiling_copy_to_device_or_null(), profiling_copy_from_device_or_null(),
         /*shm_dev=*/nullptr, /*shm_host=*/nullptr, /*shm_size=*/0, device_id
     );
 
     // RAII rollback: shm_host_ is only set at the end of init, so finalize()
     // (which early-returns on shm_host_ == nullptr) cannot clean up a partial
     // allocation. Any early return after this point therefore releases every
-    // manager-tracked device buffer + a5 host shadow allocated so far via the
-    // guard's destructor; guard.commit() disarms it on the success path.
+    // manager-tracked device buffer + non-SVM host shadow allocated so far via
+    // the guard's destructor; guard.commit() disarms it on the success path.
     // Matches dep_gen / pmu.
     profiling_common::InitRollbackGuard<decltype(manager_)> guard(manager_, free_cb);
 
@@ -112,14 +113,14 @@ int L2SwimlaneCollector::initialize(
     LOG_DEBUG("  Total shared memory:  %zu bytes (%zu KB)", total_size, total_size / 1024);
 
     // Step 2: Allocate the shared-memory region (header + SPSC slot arrays)
-    // via the base allocator. On a5 there is no halHostRegister and the
-    // device HBM region is not host-addressable, so alloc_paired_buffer
-    // mallocs a host shadow and seeds the device copy (the shadow path is
-    // selected by the copy_to_device callback installed in set_memory_context
-    // above). The host initializes the region through perf_host_ptr below,
-    // and a single profiling_copy_to_device at the end of init pushes the
-    // primed state to the device. Writing perf_host_ptr directly to the raw
-    // device pointer here would SIGSEGV — see set_memory_context above.
+    // via the base allocator. Non-SVM platforms do not expose device HBM as
+    // host-addressable memory, so alloc_paired_buffer mallocs a host shadow and
+    // seeds the device copy (the shadow path is selected by the copy_to_device
+    // callback installed in set_memory_context above). The host initializes the
+    // region through perf_host_ptr below, and a single profiling_copy_to_device
+    // at the end of init pushes the primed state to the device. Writing
+    // perf_host_ptr directly to the raw device pointer there would SIGSEGV —
+    // see set_memory_context above.
     void *perf_host_ptr = nullptr;
     void *perf_dev_ptr = alloc_paired_buffer(total_size, &perf_host_ptr);
     if (perf_dev_ptr == nullptr) {
@@ -170,7 +171,9 @@ int L2SwimlaneCollector::initialize(
     LOG_DEBUG("  buffer_capacity:        %d", PLATFORM_PROF_BUFFER_SIZE);
     LOG_DEBUG("  queue capacity:         %d", PLATFORM_PROF_READYQUEUE_SIZE);
 
-    // Step 5: Initialize L2SwimlaneAicpuTaskPools — 1 buffer per core in free_queue, rest to recycled pool
+    // Step 5: Initialize L2SwimlaneAicpuTaskPools. Seed as many buffers as
+    // the device-side free_queue can hold; any remaining buffers stay in the
+    // host recycled pool.
     for (int i = 0; i < num_aicore; i++) {
         L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(perf_host_ptr, i);
         memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool));
@@ -180,6 +183,9 @@ int L2SwimlaneCollector::initialize(
         state->head.current_buf_ptr = 0;
         state->head.current_buf_seq = 0;
 
+        const int initial_free_count = (PLATFORM_PROF_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ?
+                                           PLATFORM_PROF_BUFFERS_PER_CORE :
+                                           PLATFORM_PROF_SLOT_COUNT;
         for (int s = 0; s < PLATFORM_PROF_BUFFERS_PER_CORE; s++) {
             void *host_buf_ptr = nullptr;
             void *dev_buf_ptr = alloc_paired_buffer(sizeof(L2SwimlaneAicpuTaskBuffer), &host_buf_ptr);
@@ -191,14 +197,14 @@ int L2SwimlaneCollector::initialize(
             memset(buf, 0, sizeof(L2SwimlaneAicpuTaskBuffer));
             buf->count = 0;
 
-            if (s == 0) {
-                state->free_queue.buffer_ptrs[0] = reinterpret_cast<uint64_t>(dev_buf_ptr);
+            if (s < initial_free_count) {
+                state->free_queue.buffer_ptrs[s] = reinterpret_cast<uint64_t>(dev_buf_ptr);
             } else {
                 manager_.push_recycled(static_cast<int>(ProfBufferType::AICPU_TASK), dev_buf_ptr);
             }
         }
         wmb();
-        state->free_queue.tail = 1;
+        state->free_queue.tail = static_cast<uint32_t>(initial_free_count);
         wmb();
     }
 
@@ -208,6 +214,9 @@ int L2SwimlaneCollector::initialize(
         L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(perf_host_ptr, num_aicore, i);
         memset(ac_state, 0, sizeof(L2SwimlaneAicoreTaskPool));
 
+        const int initial_free_count = (PLATFORM_AICORE_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ?
+                                           PLATFORM_AICORE_BUFFERS_PER_CORE :
+                                           PLATFORM_PROF_SLOT_COUNT;
         for (int s = 0; s < PLATFORM_AICORE_BUFFERS_PER_CORE; s++) {
             void *host_buf_ptr = nullptr;
             void *dev_buf_ptr = alloc_paired_buffer(sizeof(L2SwimlaneAicoreTaskBuffer), &host_buf_ptr);
@@ -219,20 +228,19 @@ int L2SwimlaneCollector::initialize(
             memset(buf, 0, sizeof(L2SwimlaneAicoreTaskBuffer));
             buf->count = 0;
 
-            if (s == 0) {
-                ac_state->free_queue.buffer_ptrs[0] = reinterpret_cast<uint64_t>(dev_buf_ptr);
+            if (s < initial_free_count) {
+                ac_state->free_queue.buffer_ptrs[s] = reinterpret_cast<uint64_t>(dev_buf_ptr);
             } else {
                 manager_.push_recycled(static_cast<int>(ProfBufferType::AICORE_TASK), dev_buf_ptr);
             }
         }
         wmb();
-        ac_state->free_queue.tail = 1;
+        ac_state->free_queue.tail = static_cast<uint32_t>(initial_free_count);
         wmb();
     }
     LOG_DEBUG(
-        "Initialized buffer pools: %d L2SwimlaneAicpuTaskBuffers/core + %d L2SwimlaneAicoreTaskBuffers/core (1 in "
-        "free_queue, "
-        "rest in recycled pool)",
+        "Initialized buffer pools: %d L2SwimlaneAicpuTaskBuffers/core + %d L2SwimlaneAicoreTaskBuffers/core "
+        "(seeded up to PLATFORM_PROF_SLOT_COUNT free_queue slots, rest in recycled pool)",
         PLATFORM_PROF_BUFFERS_PER_CORE, PLATFORM_AICORE_BUFFERS_PER_CORE
     );
 
@@ -260,7 +268,8 @@ int L2SwimlaneCollector::initialize(
 
     // Step 6: Initialize per-thread phase pools — both sched and orch. Each
     // pool is sized to its own PLATFORM_PROF_{SCHED,ORCH}_BUFFERS_PER_THREAD
-    // (1 in free_queue, rest in the recycled pool tagged by kind). Templated on the
+    // (up to PLATFORM_PROF_SLOT_COUNT in free_queue, rest in the recycled pool
+    // tagged by kind). Templated on the
     // concrete TypedBuffer so the `count` zero-store uses the matching layout
     // — sched and orch buffers have DIFFERENT sizes (64B vs 32B records),
     // so a single cast type for both would land the count store past the end
@@ -279,6 +288,8 @@ int L2SwimlaneCollector::initialize(
             auto *state = get_state(perf_host_ptr, num_aicore, t);
             memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool));
             if (t >= buffer_count) continue;  // zeroed state only; no buffers (unused slot)
+            const int initial_free_count =
+                (buffers_per_thread < PLATFORM_PROF_SLOT_COUNT) ? buffers_per_thread : PLATFORM_PROF_SLOT_COUNT;
             for (int s = 0; s < buffers_per_thread; s++) {
                 void *host_buf_ptr = nullptr;
                 void *dev_buf_ptr = alloc_paired_buffer(buffer_bytes, &host_buf_ptr);
@@ -290,14 +301,14 @@ int L2SwimlaneCollector::initialize(
                 // matching Buffer type. The records payload is overwritten by
                 // AICPU on first use.
                 reinterpret_cast<Buffer *>(host_buf_ptr)->count = 0;
-                if (s == 0) {
-                    state->free_queue.buffer_ptrs[0] = reinterpret_cast<uint64_t>(dev_buf_ptr);
+                if (s < initial_free_count) {
+                    state->free_queue.buffer_ptrs[s] = reinterpret_cast<uint64_t>(dev_buf_ptr);
                 } else {
                     manager_.push_recycled(static_cast<int>(recycle_kind), dev_buf_ptr);
                 }
             }
             wmb();
-            state->free_queue.tail = 1;
+            state->free_queue.tail = static_cast<uint32_t>(initial_free_count);
             wmb();
         }
         return 0;
@@ -340,7 +351,7 @@ int L2SwimlaneCollector::initialize(
     wmb();
 
     // Push the host-initialized region (header + every pool's primed
-    // free_queue tail/buffer_ptrs[0]) down to the device. perf_host_ptr is a
+    // free_queue tail/buffer_ptrs[]) down to the device. perf_host_ptr is a
     // malloc'd shadow distinct from the device HBM region, so without this the
     // device never sees the primed free queues and AICPU/AICore read zeros.
     // The mgmt-loop mirror is read-only (device→host) and never re-pushes this
@@ -372,7 +383,7 @@ int L2SwimlaneCollector::initialize(
     perf_shared_mem_dev_ = perf_dev_ptr;
     aicore_ring_addr_table_dev_ = rotation_table_dev;
     set_memory_context(
-        alloc_cb, register_cb, free_cb, profiling_copy_to_device_for_ops, profiling_copy_from_device_for_ops,
+        alloc_cb, register_cb, free_cb, profiling_copy_to_device_or_null(), profiling_copy_from_device_or_null(),
         perf_dev_ptr, perf_host_ptr, total_size, device_id
     );
     return 0;
@@ -959,9 +970,9 @@ int L2SwimlaneCollector::finalize(L2SwimlaneUnregisterCallback unregister_cb, co
     // Every release site below goes through release_one_buffer so an
     // optional halHostRegister unregister and the free stay an inseparable
     // pair — each dev_ptr a register_cb mapped is unregistered before its
-    // device memory is freed. On a5 register_cb is null (no halHostRegister)
-    // so the unregister branch is a no-op and only the device free runs; the
-    // paired host shadows are reclaimed separately by clear_mappings() below.
+    // device memory is freed. On non-SVM platforms register_cb is null, so the
+    // unregister branch is a no-op and only the device free runs; the paired
+    // host shadows are reclaimed separately by clear_mappings() below.
     // The pairing matters on a2a3, where leaking HAL registrations across
     // init_l2_swimlane() invocations makes back-to-back tests on a reused
     // Worker fail at rc=8 from halHostRegister.
@@ -1043,7 +1054,7 @@ int L2SwimlaneCollector::finalize(L2SwimlaneUnregisterCallback unregister_cb, co
     // Free any malloc'd host shadows still tracked in the manager's
     // malloc_shadows_ — the shm region, rotation table, and per-pool buffers
     // were freed above via release_one_buffer (device pointer only), so their
-    // paired shadows (allocated by alloc_paired_buffer on a5's no-SVM path)
+    // paired shadows (allocated by alloc_paired_buffer on the non-SVM path)
     // never went through release_owned_buffers. clear_mappings() std::free's
     // them. No-op on SVM (host_ptr == dev_ptr, nothing in malloc_shadows_).
     // Matches PMU / DepGen finalize.
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
index 7377b545c..ef1233128 100644
--- a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
@@ -53,7 +53,7 @@ def _task_id(ring: int, local: int) -> int:
 
 @scene_test(level=2, runtime="tensormap_and_ringbuffer")
 class TestDepGen(SceneTestCase):
-    """Vector example, run with dep_gen enabled, then verify submit_trace.bin."""
+    """Vector example, run with dep_gen enabled, then verify generated deps."""
 
     CALLABLE = {
         "orchestration": {
diff --git a/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py b/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
index 3b621e749..04777d9a0 100644
--- a/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
+++ b/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
@@ -53,7 +53,7 @@ def _task_id(ring: int, local: int) -> int:
 
 @scene_test(level=2, runtime="tensormap_and_ringbuffer")
 class TestDepGen(SceneTestCase):
-    """Vector example, run with dep_gen enabled, then verify submit_trace.bin."""
+    """Vector example, run with dep_gen enabled, then verify generated deps."""
 
     CALLABLE = {
         "orchestration": {
diff --git a/tests/ut/cpp/CMakeLists.txt b/tests/ut/cpp/CMakeLists.txt
index 0c868a45f..1a565382c 100644
--- a/tests/ut/cpp/CMakeLists.txt
+++ b/tests/ut/cpp/CMakeLists.txt
@@ -470,7 +470,7 @@ add_executable(test_l3_l2_orch_comm_sim_runner
     ${CMAKE_SOURCE_DIR}/../../../src/common/platform/sim/sim_context/cpu_sim_context.cpp
     ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/runtime.cpp
     ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/sim/host/profiling_copy.cpp
-    ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
+    ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/l2_swimlane_collector.cpp
     ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/shared/host/pmu_collector.cpp
     ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/tensor_dump_collector.cpp
     ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/scope_stats_collector.cpp
@@ -719,7 +719,7 @@ if(SIMPLER_ENABLE_HARDWARE_TESTS)
         ${PROJECT_ROOT}/src/common/platform/shared/host/tensor_dump_collector.cpp
         ${PROJECT_ROOT}/src/a5/platform/onboard/host/profiling_copy.cpp
         ${PROJECT_ROOT}/src/a5/platform/onboard/host/memory_allocator.cpp
-        ${PROJECT_ROOT}/src/a5/platform/shared/host/l2_swimlane_collector.cpp
+        ${PROJECT_ROOT}/src/common/platform/shared/host/l2_swimlane_collector.cpp
         ${PROJECT_ROOT}/src/a5/platform/shared/host/pmu_collector.cpp
     )
     target_compile_options(test_l3_l2_orch_comm_onboard_runner PRIVATE