diff --git a/docs/dfx/dep_gen.md b/docs/dfx/dep_gen.md index 1da946890..7198ebd32 100644 --- a/docs/dfx/dep_gen.md +++ b/docs/dfx/dep_gen.md @@ -358,8 +358,8 @@ list; only the dep_gen replay graph loses the tail. | Layer | File | Role | | ----- | ---- | ---- | | Shared-mem layout | `src/{a2a3,a5}/platform/include/common/dep_gen.h` | `DepGenRecord` (4672 B base, cache-line aligned, ≤64 inline explicit_deps, per-task `block_num`) + `DepGenOverflowRecord` chain view (≤582 deps per slot) + SPSC ring + per-thread ready queue. Byte-identical layout across platforms. | -| AICPU writer | `src/{a2a3,a5}/platform/{include,shared}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build. a5 reuses the a2a3 source verbatim — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. | -| Host collector | `src/{a2a3,a5}/platform/{include/host,shared/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase` — drains ring → `records_` vector. On a5 (no SVM) it uses the base `alloc_paired_buffer`, which malloc's a host shadow + `copy_to_device`'s it and registers it via `add_malloc_shadow` so teardown can free it; `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. | +| AICPU writer | `src/{a2a3,a5}/platform/include/aicpu/dep_gen_collector_aicpu.h`, `src/common/platform/shared/aicpu/dep_gen_collector_aicpu.cpp` | Single-instance write path; weak-fallback exported to host build. Both platforms share the same writer implementation — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. | +| Host collector | `src/common/platform/include/host/dep_gen_collector.h`, `src/common/platform/shared/host/dep_gen_collector.cpp` | `ProfilerBase` — drains ring → `records_` vector. On non-SVM platforms it uses the base `alloc_paired_buffer`, which malloc's a host shadow + `copy_to_device`'s it and registers it via `add_malloc_shadow` so teardown can free it; `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. | | Capture call site | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. The schema carries `kernel_ids[3] = {aic, aiv0, aiv1}` so the swimlane post-processor can resolve `task_id → kernel` from `deps.json` at level=1 where the AICore record is the sole device-side identity source. Inactive subslots stay at `INVALID_KERNEL_ID = -1`. It also carries the SPMD logical block num (`block_num` on a2a3, `core_num` on a5's launch spec) as `tasks[].block_num`. | | Replay | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. Platform-agnostic — a5 reuses the a2a3 source verbatim. | | Device-runner hookup | `src/{a2a3,a5}/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path)` | diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md index d7a051854..241e8b984 100644 --- a/docs/dfx/l2-swimlane-profiling.md +++ b/docs/dfx/l2-swimlane-profiling.md @@ -708,7 +708,7 @@ export_swimlane_json() ← writes /l2_swimlane_record finalize(unregister, free) ``` -[`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h) +[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h) on a2a3 inherits from [`profiling_common::ProfilerBase`](../src/common/platform/include/host/profiler_base.h): the base class owns split mgmt threads, collector shards, and the @@ -838,7 +838,7 @@ l2_swimlane_collector_.export_swimlane_json() l2_swimlane_collector_.finalize() ``` -[`L2SwimlaneCollector`](../src/a5/platform/include/host/l2_swimlane_collector.h) +[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h) on a5 inherits the same CRTP base ([`profiling_common::ProfilerBase`](../src/common/platform/include/host/profiler_base.h)) as a2a3 and parameterizes diff --git a/docs/hardware/cache-coherency.md b/docs/hardware/cache-coherency.md index c0174dc40..ae5e8494c 100644 --- a/docs/hardware/cache-coherency.md +++ b/docs/hardware/cache-coherency.md @@ -99,7 +99,7 @@ Two separate concerns, often conflated: `rmb()` between the COND check and the slot reads. Concretely, the L2 swimlane staging-slot read in -`src/{a2a3,a5}/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` does +`src/common/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` does **not** call `cache_invalidate_range` on the slot, but it **does** call `rmb()` before reading `slot->task_id` and the timing fields. All of those fields are AICore writes covered by the AICore-side `dcci` in diff --git a/docs/profiling-framework.md b/docs/profiling-framework.md index b5920b25d..f5b30b916 100644 --- a/docs/profiling-framework.md +++ b/docs/profiling-framework.md @@ -177,8 +177,8 @@ the required members are: The Module structs are defined alongside their collectors in [pmu_collector.h](../src/a2a3/platform/include/host/pmu_collector.h), -[l2_swimlane_collector.h](../src/a2a3/platform/include/host/l2_swimlane_collector.h), -[dep_gen_collector.h](../src/a2a3/platform/include/host/dep_gen_collector.h), +[l2_swimlane_collector.h](../src/common/platform/include/host/l2_swimlane_collector.h), +[dep_gen_collector.h](../src/common/platform/include/host/dep_gen_collector.h), [tensor_dump_collector.h](../src/common/platform/include/host/tensor_dump_collector.h), and [scope_stats_collector.h](../src/common/platform/include/host/scope_stats_collector.h) @@ -336,13 +336,13 @@ Existing collectors are the canonical examples: - [`PmuCollector`](../src/a2a3/platform/include/host/pmu_collector.h) — single kind, per-core instances. See [pmu-profiling.md](dfx/pmu-profiling.md). -- [`DepGenCollector`](../src/a2a3/platform/include/host/dep_gen_collector.h) +- [`DepGenCollector`](../src/common/platform/include/host/dep_gen_collector.h) — single kind, one instance. See [dep_gen.md](dfx/dep_gen.md). - [`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h) — single kind, per-AICPU-thread instances. See [args-dump.md](dfx/args-dump.md). - [`ScopeStatsCollector`](../src/common/platform/include/host/scope_stats_collector.h) — single kind, one instance. See [scope-stats.md](dfx/scope-stats.md). -- [`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h) +- [`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h) — four kinds (AICPU task, scheduler phase, orchestrator phase, AICore task), per-core / per-thread instances; the canonical multi-kind example. See [l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md). diff --git a/src/a2a3/platform/include/host/dep_gen_collector.h b/src/a2a3/platform/include/host/dep_gen_collector.h deleted file mode 100644 index e5f86a89d..000000000 --- a/src/a2a3/platform/include/host/dep_gen_collector.h +++ /dev/null @@ -1,294 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file dep_gen_collector.h - * @brief Host-side dep_gen (SubmitTrace) buffer allocation, streaming - * collection, and raw binary export. - * - * Architecture: - * - BufferPoolManager: shared mgmt-thread infrastructure that - * polls per-thread ready queues, drains done-queue shards, and replenishes - * the single instance's free_queue from a unified recycled pool. - * - DepGenCollector: collector thread shards pop full DepGenBuffers from the - * manager and append their DepGenRecords to a binary file - * (submit_trace.bin). - * - * Lifecycle: - * init() — Allocate header + 1 BufferState + N DepGenBuffers - * (pre-fills free_queue; surplus → recycled pool). - * Calls set_memory_context() on the base. - * start(tf) — Inherited: launches mgmt + collector threads. - * [device execution] - * stop() — Inherited: drain queues, join threads. - * reconcile_counters() — Sanity-check current_buf_ptr is cleared by - * AICPU flush, run collected+dropped==total - * cross-check. If dropped_record_count > 0, - * the host caller skips deps.json emission - * (incomplete graph; user gets a warning). - * finalize() — Free all device memory, unregister. - * - * Output format (submit_trace.bin): a fixed-size header followed by a - * contiguous stream of DepGenRecord values. Replay (future PR) reads this - * back. Layout intentionally trivial (no varint / framing) so the - * `sizeof(DepGenRecord)` ABI in `common/dep_gen.h` is the only contract. - */ - -#ifndef SRC_A2A3_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ -#define SRC_A2A3_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "common/dep_gen.h" -#include "common/platform_config.h" -#include "common/unified_log.h" -#include "host/profiler_base.h" - -// --------------------------------------------------------------------------- -// dep_gen Module (drives BufferPoolManager) -// --------------------------------------------------------------------------- - -/** - * Internal hand-off struct delivered from a drain thread to a collector shard. - * thread_index identifies the AICPU thread queue the entry was popped from - * (always equal to the orchestrator thread index, since dep_gen is single- - * instance — exposed for symmetry with PmuReadyBufferInfo). - */ -struct DepGenReadyBufferInfo { - uint32_t instance_index; // Always 0 (single instance) - uint32_t thread_index; // AICPU thread queue index this entry came from - void *dev_buffer_ptr; - void *host_buffer_ptr; - uint32_t buffer_seq; -}; - -struct DepGenModule { - using DataHeader = DepGenDataHeader; - using ReadyEntry = DepGenReadyQueueEntry; - using ReadyBufferInfo = ::DepGenReadyBufferInfo; - using FreeQueue = DepGenFreeQueue; - - static constexpr int kBufferKinds = 1; - static constexpr uint32_t kReadyQueueSize = PLATFORM_DEP_GEN_READYQUEUE_SIZE; - static constexpr uint32_t kSlotCount = PLATFORM_DEP_GEN_SLOT_COUNT; - static constexpr const char *kSubsystemName = "DepGenModule"; - static constexpr int kMgmtDrainThreadCount = PLATFORM_MAX_AICPU_THREADS; - static constexpr int kCollectorThreadCount = PLATFORM_MAX_AICPU_THREADS; - - /** - * Buffers grown by proactive_replenish are batch-allocated up to the - * per-instance ceiling minus the slot count. - */ - static constexpr int batch_size(int /*kind*/) { - constexpr int kBatch = PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE - PLATFORM_DEP_GEN_SLOT_COUNT; - return kBatch < 1 ? 1 : kBatch; - } - - static DataHeader *header_from_shm(void *shm) { return get_dep_gen_header(shm); } - - /** - * `count` is intentionally NOT reset here — AICPU is the sole writer and - * resets it itself on flush/drop/pop. - */ - static std::optional> - resolve_entry(void *shm, DataHeader *header, int q, const ReadyEntry &entry) { - if (shm == nullptr || header == nullptr) { - LOG_ERROR("DepGenModule: invalid shared memory/header while resolving ready entry"); - return std::nullopt; - } - if (header->num_instances != 1 || entry.instance_index >= header->num_instances) { - LOG_ERROR( - "DepGenModule: invalid ready entry instance=%u (num_instances=%u)", entry.instance_index, - header->num_instances - ); - return std::nullopt; - } - DepGenBufferState *state = get_dep_gen_buffer_state(shm, static_cast(entry.instance_index)); - profiling_common::EntrySite site; - site.kind = 0; - site.free_queue = &state->free_queue; - site.buffer_size = sizeof(DepGenBuffer); - site.info.instance_index = entry.instance_index; - site.info.thread_index = static_cast(q); - site.info.dev_buffer_ptr = reinterpret_cast(entry.buffer_ptr); - site.info.host_buffer_ptr = nullptr; // filled by ProfilerAlgorithms - site.info.buffer_seq = entry.buffer_seq; - return site; - } - - template - static void for_each_instance(void *shm, DataHeader *header, Cb &&cb) { - const int n = static_cast(header->num_instances); - for (int i = 0; i < n; i++) { - DepGenBufferState *state = get_dep_gen_buffer_state(shm, i); - cb(/*kind=*/0, &state->free_queue, sizeof(DepGenBuffer)); - } - } -}; - -// --------------------------------------------------------------------------- -// Memory callbacks — thin aliases for the canonical profiling_common shapes. -// alloc / free are std::function so callers bind their MemoryAllocator via -// lambda capture; register / unregister stay as plain function pointers -// because they wrap stateless HAL globals (halHost*). -// --------------------------------------------------------------------------- - -using DepGenAllocCallback = profiling_common::ProfAllocCallback; -using DepGenRegisterCallback = profiling_common::ProfRegisterCallback; -using DepGenUnregisterCallback = profiling_common::ProfUnregisterCallback; -using DepGenFreeCallback = profiling_common::ProfFreeCallback; - -// --------------------------------------------------------------------------- -// DepGenCollector -// --------------------------------------------------------------------------- - -class DepGenCollector : public profiling_common::ProfilerBase { -public: - DepGenCollector() = default; - ~DepGenCollector(); - - DepGenCollector(const DepGenCollector &) = delete; - DepGenCollector &operator=(const DepGenCollector &) = delete; - - static constexpr int kIdleTimeoutSec = PLATFORM_DEP_GEN_TIMEOUT_SECONDS; - static constexpr const char *kSubsystemName = "DepGen"; - - /** - * Allocate dep_gen shared memory and pre-populate the free_queue. - * - * Allocates a DepGenDataHeader + 1 DepGenBufferState, plus - * PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE DepGenBuffers. The first - * PLATFORM_DEP_GEN_SLOT_COUNT buffers go directly into the free_queue; - * the surplus go into BufferPoolManager's shared recycled pool. - * - * @param num_threads Number of AICPU scheduling threads (so the - * DataHeader sizes its per-thread ready queues) - * @param submit_trace_path Output file path (.bin) - * @param alloc_cb Memory allocation callback - * @param register_cb halHostRegister callback (nullptr in sim) - * @param free_cb Memory free callback - * @param device_id Device ID - * @return 0 on success, non-zero on failure - */ - int init( - int num_threads, const DepGenAllocCallback &alloc_cb, DepGenRegisterCallback register_cb, - const DepGenFreeCallback &free_cb, int device_id - ); - - /** - * Device pointer to the DepGenDataHeader. Set kernel_args.dep_gen_data_base - * to this after init() so AICPU can find the shared memory via - * set_platform_dep_gen_base(). - */ - void *get_dep_gen_shm_device_ptr() const { return shm_dev_; } - - /** - * Per-buffer callback invoked by ProfilerBase's poll loop. Appends the - * buffer's DepGenRecord entries to the in-memory ``records_`` vector - * (no disk I/O — the host replay consumes that vector directly via - * ``records()`` once the device run completes). - */ - void on_buffer_collected(const DepGenReadyBufferInfo &info); - - /** - * After stop(): cross-check collected + dropped == total. If dropped > 0, - * the host caller skips deps.json emission so users get an incomplete- - * graph warning rather than partial data they might mistake for complete. - * - * @return true iff the run captured a complete trace (no drops, no leftovers). - */ - bool reconcile_counters(); - - /** - * Free all device memory and release the in-memory record buffer. Idempotent. - */ - void finalize(DepGenUnregisterCallback unregister_cb, const DepGenFreeCallback &free_cb); - - /** - * @return true if init() succeeded and finalize() has not run. - */ - bool is_initialized() const { return initialized_; } - - /** - * Total DepGenRecords drained from the device-side ring buffer so far. - */ - uint64_t total_collected() const { return total_collected_; } - - /** - * In-memory record buffer (host replay's input). Valid between init() - * and finalize(); pointer/size stay stable after stop() returns, which - * is when the caller hands them to ``dep_gen_replay_emit_deps_json``. - */ - const std::vector &records() const { return records_; } - -private: - bool initialized_ = false; - int num_threads_ = 0; - - // Shared memory region (DepGenDataHeader + DepGenBufferState[1]). - // shm_host_ / device_id_ live on ProfilerBase (set via set_memory_context - // in init()). - void *shm_dev_ = nullptr; - bool shm_registered_ = false; - size_t shm_size_ = 0; - - bool buffers_registered_ = false; - - // In-memory record buffer — drained from the device ring on - // on_buffer_collected() and consumed by the host replay directly (no - // disk hop). Mutex serializes the mgmt thread's appends against the - // (rare) reader on the same collector instance. - std::vector records_; - std::mutex records_mutex_; - - // Running total of records appended. Equal to ``records_.size()`` after - // every append; kept separately for the reconcile_counters cross-check - // even when records_ may be inspected concurrently. - uint64_t total_collected_ = 0; - - DepGenDataHeader *dep_gen_header() const { return get_dep_gen_header(shm_host_); } - DepGenBufferState *dep_gen_state(int idx = 0) const { return get_dep_gen_buffer_state(shm_host_, idx); } - - void append_buffer_records(const void *buf_host_ptr); -}; - -/** - * Build the ``deps.json`` output path under the caller-provided per-task - * directory. Filename is fixed (no timestamp) — the directory is the - * per-task uniqueness boundary, mirroring make_pmu_csv_path() and the now- - * removed make_dep_gen_path() for submit_trace.bin (deps.json is the only - * on-disk dep_gen artifact since the in-memory capture refactor). - */ -inline std::string make_deps_json_path(const std::string &output_dir) { - // Use std::filesystem::path's operator/ for join — robust against trailing - // slashes or path quirks that bare string concat would silently pass - // through. The sibling make_pmu_csv_path / make_l2_swimlane_path still use - // string concat; converting those is a follow-up cleanup since the - // project's output_prefix paths come from scene_test.py's pathlib join - // (never trailing-slashed in practice). - std::filesystem::path dir(output_dir); - std::error_code ec; - std::filesystem::create_directories(dir, ec); - if (ec) { - LOG_WARN("Failed to create dep_gen output directory %s: %s", output_dir.c_str(), ec.message().c_str()); - } - return (dir / "deps.json").string(); -} - -#endif // SRC_A2A3_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ diff --git a/src/a2a3/platform/include/host/l2_swimlane_collector.h b/src/a2a3/platform/include/host/l2_swimlane_collector.h deleted file mode 100644 index b8bd2bb9b..000000000 --- a/src/a2a3/platform/include/host/l2_swimlane_collector.h +++ /dev/null @@ -1,499 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file l2_swimlane_collector.h - * @brief Platform-agnostic performance data collector with dynamic memory management. - * - * Architecture: - * - BufferPoolManager: shared mgmt-thread infrastructure that polls - * the AICPU ready queue, replenishes per-core / per-thread free queues, and - * hands full buffers off to collector thread shards. - * - L2SwimlaneCollector: collector thread shards copy records from manager ready queues - * into host vectors; the owner thread exports the swimlane visualization after stop(). - * - * Memory operations are injected through callbacks for sim/onboard portability. - */ - -#ifndef SRC_A2A3_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_ -#define SRC_A2A3_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "common/l2_swimlane_profiling.h" -#include "common/memory_barrier.h" -#include "common/platform_config.h" -#include "common/unified_log.h" -#include "host/profiler_base.h" - -// --------------------------------------------------------------------------- -// L2 Perf profiling Module (drives BufferPoolManager) -// --------------------------------------------------------------------------- - -/** - * L2 Perf has four distinct buffer kinds going through one ready queue per - * AICPU thread: - * - kind 0: per-core L2SwimlaneAicpuTaskBuffer (task records) - * - kind 1: per-thread L2SwimlaneAicpuSchedPhaseBuffer (scheduler phase records) - * - kind 2: per-thread L2SwimlaneAicpuOrchPhaseBuffer (orchestrator phase records) - * - kind 3: per-core L2SwimlaneAicoreTaskBuffer (AICore-written records) - * The ReadyQueueEntry::kind flag picks among them. - */ - -/** - * Buffer kind discriminator carried in ReadyBufferInfo and used to index the - * per-kind recycled pool inside BufferPoolManager. Values match - * L2SwimlaneBufferKind 1:1. - */ -enum class ProfBufferType { - AICPU_TASK = 0, - AICPU_SCHED_PHASE = 1, - AICPU_ORCH_PHASE = 2, - AICORE_TASK = 3, -}; - -/** - * Information about a ready (full) buffer, passed from mgmt thread to main thread. - */ -struct ReadyBufferInfo { - ProfBufferType type; - uint32_t index; // core_index (task) or thread_idx (phase) - uint32_t slot_idx; // Reserved (unused in free queue design) - void *dev_buffer_ptr; // Device address of the full buffer - void *host_buffer_ptr; // Host-mapped address (sim: same as dev) - uint32_t buffer_seq; // Sequence number for ordering -}; - -struct L2SwimlaneModule { - using DataHeader = L2SwimlaneDataHeader; - using ReadyEntry = ReadyQueueEntry; - using ReadyBufferInfo = ::ReadyBufferInfo; - using FreeQueue = L2SwimlaneFreeQueue; // all pool types share the same free_queue layout - - static constexpr int kBufferKinds = 4; - static constexpr uint32_t kReadyQueueSize = PLATFORM_PROF_READYQUEUE_SIZE; - static constexpr uint32_t kSlotCount = PLATFORM_PROF_SLOT_COUNT; - static constexpr const char *kSubsystemName = "L2SwimlaneModule"; - static constexpr int kMgmtDrainThreadCount = PLATFORM_MAX_AICPU_THREADS; - static constexpr int kCollectorThreadCount = PLATFORM_MAX_AICPU_THREADS; - - /** - * batch_size for proactive_replenish's alloc fallback. Sized so that a - * fully empty recycled pool refills to the configured per-instance - * ceiling in one tick. Sched and orch phase pools are sized independently - * (PLATFORM_PROF_{SCHED,ORCH}_BUFFERS_PER_THREAD). - */ - static constexpr int batch_size(int kind) { - constexpr int kPerfBatch = PLATFORM_PROF_BUFFERS_PER_CORE - PLATFORM_PROF_SLOT_COUNT; - constexpr int kSchedBatch = PLATFORM_PROF_SCHED_BUFFERS_PER_THREAD - PLATFORM_PROF_SLOT_COUNT; - constexpr int kOrchBatch = PLATFORM_PROF_ORCH_BUFFERS_PER_THREAD - PLATFORM_PROF_SLOT_COUNT; - constexpr int kAicoreBatch = PLATFORM_AICORE_BUFFERS_PER_CORE - PLATFORM_PROF_SLOT_COUNT; - int b = kPerfBatch; - switch (static_cast(kind)) { - case L2SwimlaneBufferKind::AicpuTask: - b = kPerfBatch; - break; - case L2SwimlaneBufferKind::AicpuSchedPhase: - b = kSchedBatch; - break; - case L2SwimlaneBufferKind::AicpuOrchPhase: - b = kOrchBatch; - break; - case L2SwimlaneBufferKind::AicoreTask: - b = kAicoreBatch; - break; - } - return b < 1 ? 1 : b; - } - - static int kind_of(const ReadyBufferInfo &info) { return static_cast(info.type); } - - static DataHeader *header_from_shm(void *shm) { return get_l2_swimlane_header(shm); } - - template - static void refresh_replenish_metadata(Mgr &mgr, DataHeader *header) { - mgr.read_range_from_device(&header->num_sched_phase_threads, sizeof(header->num_sched_phase_threads)); - mgr.read_range_from_device(&header->num_orch_phase_threads, sizeof(header->num_orch_phase_threads)); - rmb(); - } - - /** - * Branch on entry.kind to pick the per-core task state, per-thread sched- - * or orch-phase state, or per-core AICore state. Returns nullopt for - * out-of-range kind or core_index. - */ - static std::optional> - resolve_entry(void *shm, DataHeader *header, int /*q*/, const ReadyEntry &entry) { - const int num_cores = static_cast(header->num_cores); - const L2SwimlaneBufferKind kind = entry.kind; - - // Validate kind first — out-of-range silently falling into the wrong - // branch reads a wrong-typed pool. - if (kind != L2SwimlaneBufferKind::AicpuTask && kind != L2SwimlaneBufferKind::AicpuSchedPhase && - kind != L2SwimlaneBufferKind::AicpuOrchPhase && kind != L2SwimlaneBufferKind::AicoreTask) { - LOG_ERROR("L2SwimlaneModule: invalid entry kind=%u", static_cast(kind)); - return std::nullopt; - } - - // Sched/orch phase entries are indexed by thread_idx; task/aicore by core_index. - const bool is_phase = - (kind == L2SwimlaneBufferKind::AicpuSchedPhase) || (kind == L2SwimlaneBufferKind::AicpuOrchPhase); - if (is_phase) { - if (entry.core_index >= static_cast(PLATFORM_MAX_AICPU_THREADS)) { - LOG_ERROR("L2SwimlaneModule: invalid phase entry: thread=%u", entry.core_index); - return std::nullopt; - } - } else { - if (entry.core_index >= static_cast(num_cores)) { - LOG_ERROR( - "L2SwimlaneModule: invalid task entry: core=%u kind=%u", entry.core_index, - static_cast(kind) - ); - return std::nullopt; - } - } - - profiling_common::EntrySite site; - site.kind = static_cast(kind); - site.info.index = entry.core_index; - site.info.slot_idx = 0; - site.info.dev_buffer_ptr = reinterpret_cast(entry.buffer_ptr); - site.info.host_buffer_ptr = nullptr; // filled by ProfilerAlgorithms - site.info.buffer_seq = entry.buffer_seq; - - switch (kind) { - case L2SwimlaneBufferKind::AicpuTask: { - auto *state = get_perf_buffer_state(shm, static_cast(entry.core_index)); - site.free_queue = &state->free_queue; - site.buffer_size = sizeof(L2SwimlaneAicpuTaskBuffer); - site.info.type = ProfBufferType::AICPU_TASK; - break; - } - case L2SwimlaneBufferKind::AicpuSchedPhase: { - auto *state = get_sched_phase_buffer_state(shm, num_cores, static_cast(entry.core_index)); - site.free_queue = &state->free_queue; - site.buffer_size = sizeof(L2SwimlaneAicpuSchedPhaseBuffer); - site.info.type = ProfBufferType::AICPU_SCHED_PHASE; - break; - } - case L2SwimlaneBufferKind::AicpuOrchPhase: { - auto *state = get_orch_phase_buffer_state(shm, num_cores, static_cast(entry.core_index)); - site.free_queue = &state->free_queue; - site.buffer_size = sizeof(L2SwimlaneAicpuOrchPhaseBuffer); - site.info.type = ProfBufferType::AICPU_ORCH_PHASE; - break; - } - case L2SwimlaneBufferKind::AicoreTask: { - auto *ac_state = get_aicore_buffer_state(shm, num_cores, static_cast(entry.core_index)); - site.free_queue = &ac_state->free_queue; - site.buffer_size = sizeof(L2SwimlaneAicoreTaskBuffer); - site.info.type = ProfBufferType::AICORE_TASK; - break; - } - } - return site; - } - - template - static void for_each_instance(void *shm, DataHeader *header, Cb &&cb) { - const int num_cores = static_cast(header->num_cores); - - // AicpuTask: per-core (kind 0) - for (int i = 0; i < num_cores; i++) { - auto *state = get_perf_buffer_state(shm, i); - cb(/*kind=*/static_cast(L2SwimlaneBufferKind::AicpuTask), &state->free_queue, - sizeof(L2SwimlaneAicpuTaskBuffer)); - } - - // AicoreTask: per-core (kind 3) - for (int i = 0; i < num_cores; i++) { - auto *ac_state = get_aicore_buffer_state(shm, num_cores, i); - cb(/*kind=*/static_cast(L2SwimlaneBufferKind::AicoreTask), &ac_state->free_queue, - sizeof(L2SwimlaneAicoreTaskBuffer)); - } - - // AicpuSchedPhase: per-thread (kind 1) — gated on the header's - // sched-phase thread count (zero when phase init never ran). - // Bounds-clamp against PLATFORM_MAX_AICPU_THREADS so a corrupted - // device-shared value can't walk off the pool array. - int num_sched_phase_threads = static_cast(header->num_sched_phase_threads); - if (num_sched_phase_threads > PLATFORM_MAX_AICPU_THREADS) { - num_sched_phase_threads = 0; - } - for (int t = 0; t < num_sched_phase_threads; t++) { - auto *state = get_sched_phase_buffer_state(shm, num_cores, t); - cb(/*kind=*/static_cast(L2SwimlaneBufferKind::AicpuSchedPhase), &state->free_queue, - sizeof(L2SwimlaneAicpuSchedPhaseBuffer)); - } - - // AicpuOrchPhase: per-thread (kind 2) — same bounds clamp. - int num_orch_phase_threads = static_cast(header->num_orch_phase_threads); - if (num_orch_phase_threads > PLATFORM_MAX_AICPU_THREADS) { - num_orch_phase_threads = 0; - } - for (int t = 0; t < num_orch_phase_threads; t++) { - auto *state = get_orch_phase_buffer_state(shm, num_cores, t); - cb(/*kind=*/static_cast(L2SwimlaneBufferKind::AicpuOrchPhase), &state->free_queue, - sizeof(L2SwimlaneAicpuOrchPhaseBuffer)); - } - } -}; - -// Memory callbacks — thin aliases for the canonical profiling_common shapes. -// alloc / free are std::function so callers bind their MemoryAllocator via -// lambda capture; register / unregister stay as plain function pointers -// because they wrap stateless HAL globals (halHost*). -using L2SwimlaneAllocCallback = profiling_common::ProfAllocCallback; -using L2SwimlaneRegisterCallback = profiling_common::ProfRegisterCallback; -using L2SwimlaneUnregisterCallback = profiling_common::ProfUnregisterCallback; -using L2SwimlaneFreeCallback = profiling_common::ProfFreeCallback; - -// ============================================================================= -// L2SwimlaneCollector -// ============================================================================= - -/** - * Performance data collector. - * - * Lifecycle: - * 1. initialize() — allocate shared memory, pre-fill free_queues, - * hand the memory context to the base via - * set_memory_context(). - * 2. start(tf) — inherited from ProfilerBase; launches - * drain/refill, replenish, and collector threads. - * 3. ... device execution ... - * 4. stop() — joins drain/refill and replenish before - * letting collector threads exit. - * 5. read_phase_header_metadata() — single-shot read of the core→thread - * mapping from L2SwimlaneDataHeader. - * 6. reconcile_counters() — device-side three-bucket accounting for - * both PERF and PHASE pools (total / - * collected / dropped). - * 7. export_swimlane_json() / finalize(). - * - * Host never reads from device-side `current_buf_ptr` to recover records: - * device flush is the only data path. Any non-zero `current_buf_ptr` after - * stop() is logged as a bug. - */ -class L2SwimlaneCollector : public profiling_common::ProfilerBase { -public: - L2SwimlaneCollector() = default; - ~L2SwimlaneCollector(); - - L2SwimlaneCollector(const L2SwimlaneCollector &) = delete; - L2SwimlaneCollector &operator=(const L2SwimlaneCollector &) = delete; - - // ProfilerBase contract - static constexpr int kIdleTimeoutSec = PLATFORM_PROF_TIMEOUT_SECONDS; - static constexpr const char *kSubsystemName = "L2Swimlane"; - - /** - * Initialize performance profiling. - * - * Allocates the shared-memory region (header + per-core / per-thread - * BufferStates), pre-allocates initial L2SwimlaneAicpuTaskBuffers and PhaseBuffers, - * and seeds the per-pool free_queues + the framework's recycled pools. - * - * @param num_aicore Number of AICore instances - * @param device_id Device ID (forwarded to register_cb) - * @param l2_swimlane_level Collection granularity (DISABLED / AICORE_TIMING - * / AICPU_TIMING / SCHED_PHASES / ORCH_PHASES). - * Written into - * `L2SwimlaneDataHeader::l2_swimlane_level` - * so AICPU can promote it in - * `l2_swimlane_aicpu_init`, AND cached on the - * collector so `export_swimlane_json()` - * can gate phase sections and stamp the - * JSON `version`. - * @param alloc_cb Device memory allocation callback - * @param register_cb Memory registration callback (nullptr for - * simulation) - * @param free_cb Device memory free callback - * @param user_data Opaque pointer forwarded to callbacks - * @param output_prefix Per-task directory; l2_swimlane_records.json - * lands here. Required (non-empty); - * CallConfig::validate() enforces this - * upstream. - * @return 0 on success, error code on failure - */ - int initialize( - int num_aicore, int aicpu_thread_num, int device_id, L2SwimlaneLevel l2_swimlane_level, - const L2SwimlaneAllocCallback &alloc_cb, L2SwimlaneRegisterCallback register_cb, - const L2SwimlaneFreeCallback &free_cb, const std::string &output_prefix - ); - - /** - * Per-buffer callback invoked by ProfilerBase's collector loop. Dispatches on - * info.type to copy either an L2SwimlaneAicpuTaskBuffer (PERF_RECORD) into the per-core - * record vector, or a L2SwimlaneAicpuSchedPhaseBuffer / L2SwimlaneAicpuOrchPhaseBuffer into the per-thread - * phase-record vector. - */ - void on_buffer_collected(const ReadyBufferInfo &info); - - /** - * Publish per-core core_type (AIC/AIV/...) so the host emit path can - * resolve the lane label without consulting an AICPU task record. Required - * for AICORE_TIMING (level=1) where complete_task is bypassed and the - * AICore record alone is on disk. Caller is the device_runner — sim sets - * it from `runtime.workers[i].core_type` (rule-based), onboard sets it - * from the handshake-discovered table. - * - * Safe to call multiple times; the last call wins. - * - * @param types CoreType[n] table indexed by core_id - * @param n table length (typically `num_aicore`) - */ - void set_core_types(const CoreType *types, int n); - - /** - * Export collected records as a Chrome Trace Event JSON (swimlane view). - * Writes /l2_swimlane_records.json — directory is captured at - * initialize() time. - * - * @return 0 on success, error code on failure - */ - int export_swimlane_json(); - - /** - * Free all device memory and unregister mappings. Idempotent on a - * collector that was never initialized. - * - * @param unregister_cb Memory unregister callback (nullptr in sim mode) - * @param free_cb Memory free callback - * @param user_data Opaque pointer forwarded to callbacks - * @return 0 on success, error code on failure - */ - int finalize(L2SwimlaneUnregisterCallback unregister_cb, const L2SwimlaneFreeCallback &free_cb); - - /** - * @return true if initialize() succeeded and finalize() has not run. - */ - bool is_initialized() const { return shm_host_ != nullptr; } - - /** - * Device pointer to the L2SwimlaneDataHeader. Set kernel_args.l2_swimlane_data_base - * to this after initialize() succeeds so the AICPU side can find the - * shared memory. - */ - void *get_l2_swimlane_setup_device_ptr() const { return perf_shared_mem_dev_; } - - /** - * Device pointer to a uint64_t[num_aicore] table where each entry will - * hold this core's `&L2SwimlaneAicoreTaskPool::rotation` device address. Host - * only allocates the bytes here; AICPU populates the entries inside - * `l2_swimlane_aicpu_init`. Freed by finalize(). Set kernel_args.l2_swimlane_aicore_rotation_table - * to this so the AICore kernel entry can index by block_idx and feed the - * per-core rotation channel into `set_l2_swimlane_aicore_head_slot()`. Returns - * nullptr before initialize() succeeds. - */ - void *get_aicore_ring_addr_table_device_ptr() const { return aicore_ring_addr_table_dev_; } - - /** - * Read AICPU phase metadata that lives in L2SwimlaneDataHeader (not on the - * buffer pipeline): the core→thread mapping plus a has-data signal - * derived from accumulated per-event records. Single-shot — must be - * called after stop() so the shm region has settled. - */ - void read_phase_header_metadata(); - - /** - * Sum per-core / per-thread total_record_count and dropped_record_count - * for both the PERF and PHASE pools, cross-check - * `collected + dropped == device_total`, and LOG_ERROR any non-zero - * current_buf_ptr (which would indicate a device-side flush failure that - * left a buffer un-enqueued — see .claude/rules/discipline.md). - * The PHASE block is skipped silently when no phase activity was - * recorded (runtimes that don't emit phase records). Must be called - * after stop(). - */ - void reconcile_counters(); - - /** - * @return Per-core L2SwimlaneAicpuTaskRecord vectors (indexed by core_index). For tests. - */ - const std::vector> &get_records() const { return collected_perf_records_; } - -private: - // Shared memory pointers. shm_host_ / device_id_ live on ProfilerBase - // (set via set_memory_context in initialize()). - void *perf_shared_mem_dev_{nullptr}; - - // Standalone uint64_t[num_aicore] table holding per-core L2SwimlaneAicoreTaskBuffer - // addresses. Allocated in initialize(), freed in finalize(). AICore reads - // ring_table[block_idx] via KernelArgs::l2_swimlane_aicore_rotation_table. - void *aicore_ring_addr_table_dev_{nullptr}; - - int num_aicore_{0}; - // Total AICPU threads launched this run. The dedicated orchestrator runs on - // the last one (aicpu_thread_num_ - 1); used to report its thread number in - // the phase-metadata log (orch-phase is a single pool, so its index alone - // does not encode the AICPU thread). - int aicpu_thread_num_{0}; - L2SwimlaneLevel l2_swimlane_level_{L2SwimlaneLevel::DISABLED}; - - // Per-core core_type table populated by set_core_types(). Indexed by - // core_id; size matches num_aicore_ once populated. Used by the level=1 - // emit path which has no AICPU record to read core_type from. - std::vector core_types_; - - // Per-task output directory captured at initialize() time. Consumed by - // export_swimlane_json() to build /l2_swimlane_records.json. - std::string output_prefix_; - - // Collected data (per-core vectors, indexed by core_index) - std::vector> collected_perf_records_; - - // Collected AICore records (per-core vectors). Each entry is a full - // L2SwimlaneAicoreTaskRecord captured from a rotated L2SwimlaneAicoreTaskBuffer. The - // order across rotations is preserved by `copy_aicore_buffer` (we sort - // incoming buffers by buffer_seq before flattening). - std::vector> collected_aicore_records_; - - // AICPU phase profiling data — separate per-thread vectors for sched and - // orch records (kind-tagged at routing time; no parse-time discrimination). - std::vector> collected_sched_phase_records_; - std::vector> collected_orch_phase_records_; - std::atomic has_phase_data_{false}; - - // Core-to-thread mapping (core_id → scheduler thread index, -1 = unassigned) - std::vector core_to_thread_; - - // Running totals used at reconcile time to cross-check device-side counters. - std::atomic total_perf_collected_{0}; - std::atomic total_sched_phase_collected_{0}; - std::atomic total_orch_phase_collected_{0}; - - std::array perf_record_mutexes_; - std::array aicore_record_mutexes_; - std::array sched_phase_record_mutexes_; - std::array orch_phase_record_mutexes_; - - // Allocate a single buffer (any of the L2SwimlaneAicpu*Buffer kinds) and register it. - // The RAII counterpart ``release_one_buffer`` lives on ProfilerBase and - // is shared with every other collector. - void *alloc_single_buffer(size_t size, void **host_ptr_out); - - // Per-buffer-kind handlers used by on_buffer_collected. - void copy_perf_buffer(const ReadyBufferInfo &info); - void copy_sched_phase_buffer(const ReadyBufferInfo &info); - void copy_orch_phase_buffer(const ReadyBufferInfo &info); - void copy_aicore_buffer(const ReadyBufferInfo &info); -}; - -#endif // SRC_A2A3_PLATFORM_INCLUDE_HOST_L2_SWIMLANE_COLLECTOR_H_ diff --git a/src/a2a3/platform/onboard/host/CMakeLists.txt b/src/a2a3/platform/onboard/host/CMakeLists.txt index 84e6f3f3c..dc4871139 100644 --- a/src/a2a3/platform/onboard/host/CMakeLists.txt +++ b/src/a2a3/platform/onboard/host/CMakeLists.txt @@ -57,11 +57,11 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/host_regs.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/comm_hccl.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp" ) # Add common/aicpu_loader/host sources (LoadAicpuOp) diff --git a/src/a2a3/platform/shared/host/dep_gen_collector.cpp b/src/a2a3/platform/shared/host/dep_gen_collector.cpp deleted file mode 100644 index af549ccf8..000000000 --- a/src/a2a3/platform/shared/host/dep_gen_collector.cpp +++ /dev/null @@ -1,275 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file dep_gen_collector.cpp - * @brief Host-side dep_gen collector. The mgmt-thread + buffer-pool machinery - * lives in profiling_common::BufferPoolManager parameterized by - * DepGenModule (host/dep_gen_collector.h); this file owns the - * per-buffer on_buffer_collected callback (in-memory append) and the - * device-side cross-check. Records stay in ``records_`` and are - * consumed directly by the host replay — no on-disk submit_trace.bin - * intermediary. - */ - -#include "host/dep_gen_collector.h" - -#include -#include -#include - -#include "common/memory_barrier.h" -#include "common/unified_log.h" - -DepGenCollector::~DepGenCollector() { stop(); } - -// --------------------------------------------------------------------------- -// init -// --------------------------------------------------------------------------- - -int DepGenCollector::init( - int num_threads, const DepGenAllocCallback &alloc_cb, DepGenRegisterCallback register_cb, - const DepGenFreeCallback &free_cb, int device_id -) { - if (num_threads <= 0 || alloc_cb == nullptr || free_cb == nullptr) { - LOG_ERROR("DepGenCollector::init: invalid arguments"); - return -1; - } - - num_threads_ = num_threads; - buffers_registered_ = (register_cb != nullptr); - total_collected_ = 0; - records_.clear(); - execution_complete_.store(false, std::memory_order_release); - - // ---- Allocate shared header + buffer-state region ---- - // dep_gen is single-instance: just one DepGenBufferState after the header. - const int num_instances = 1; - shm_size_ = calc_dep_gen_shm_size(num_instances); - shm_dev_ = alloc_cb(shm_size_); - if (shm_dev_ == nullptr) { - LOG_ERROR("DepGenCollector: failed to allocate dep_gen shared memory (%zu bytes)", shm_size_); - return -1; - } - - if (register_cb != nullptr) { - int rc = register_cb(shm_dev_, shm_size_, device_id, &shm_host_); - if (rc != 0) { - LOG_ERROR("DepGenCollector: halHostRegister for dep_gen SHM failed: %d", rc); - free_cb(shm_dev_); - shm_dev_ = nullptr; - return rc; - } - shm_registered_ = true; - } else { - shm_host_ = shm_dev_; - } - std::memset(shm_host_, 0, shm_size_); - - DepGenDataHeader *hdr = get_dep_gen_header(shm_host_); - hdr->num_instances = static_cast(num_instances); - - // ---- Allocate DepGenBuffers, populate free_queue + recycled pool ---- - const size_t buf_size = sizeof(DepGenBuffer); - DepGenBufferState *state = dep_gen_state(0); - - for (int b = 0; b < PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE; b++) { - void *dev_ptr = alloc_cb(buf_size); - if (dev_ptr == nullptr) { - LOG_ERROR("DepGenCollector: failed to allocate DepGenBuffer b=%d", b); - return -1; - } - - void *host_ptr = dev_ptr; - if (register_cb != nullptr) { - int rc = register_cb(dev_ptr, buf_size, device_id, &host_ptr); - if (rc != 0) { - LOG_ERROR("DepGenCollector: halHostRegister for DepGenBuffer b=%d failed: %d", b, rc); - free_cb(dev_ptr); - return rc; - } - } - std::memset(host_ptr, 0, buf_size); - - manager_.register_mapping(dev_ptr, host_ptr); - - if (b < PLATFORM_DEP_GEN_SLOT_COUNT) { - uint32_t tail = state->free_queue.tail; - assert(tail - state->free_queue.head < PLATFORM_DEP_GEN_SLOT_COUNT && "free_queue overflow on init"); - state->free_queue.buffer_ptrs[tail % PLATFORM_DEP_GEN_SLOT_COUNT] = reinterpret_cast(dev_ptr); - wmb(); - state->free_queue.tail = tail + 1; - wmb(); - } else { - manager_.push_recycled(0, dev_ptr); - } - } - - initialized_ = true; - set_memory_context( - alloc_cb, register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr, shm_dev_, shm_host_, shm_size_, - device_id - ); - - LOG_INFO_V0( - "DepGen collector initialized: %d threads, SHM=0x%lx (records held in memory until replay)", num_threads, - reinterpret_cast(shm_dev_) - ); - return 0; -} - -// --------------------------------------------------------------------------- -// Record accumulation (in-memory — no disk hop) -// --------------------------------------------------------------------------- - -void DepGenCollector::append_buffer_records(const void *buf_host_ptr) { - const DepGenBuffer *buf = reinterpret_cast(buf_host_ptr); - uint32_t n = buf->count; - if (n > static_cast(PLATFORM_DEP_GEN_RECORDS_PER_BUFFER)) { - n = static_cast(PLATFORM_DEP_GEN_RECORDS_PER_BUFFER); - } - if (n == 0) return; - - std::scoped_lock lock(records_mutex_); - records_.insert(records_.end(), buf->records, buf->records + n); - total_collected_ += n; -} - -// --------------------------------------------------------------------------- -// ProfilerBase callback -// --------------------------------------------------------------------------- - -void DepGenCollector::on_buffer_collected(const DepGenReadyBufferInfo &info) { - append_buffer_records(info.host_buffer_ptr); -} - -// --------------------------------------------------------------------------- -// reconcile_counters -// --------------------------------------------------------------------------- - -bool DepGenCollector::reconcile_counters() { - if (shm_host_ == nullptr) return false; - - rmb(); - - bool clean = true; - - DepGenBufferState *state = dep_gen_state(0); - uint64_t buf_dev = state->current_buf_ptr; - if (buf_dev != 0) { - void *host_ptr = manager_.resolve_host_ptr(reinterpret_cast(buf_dev)); - if (host_ptr != nullptr) { - uint32_t count = reinterpret_cast(host_ptr)->count; - if (count != 0) { - LOG_ERROR( - "dep_gen reconcile: un-flushed buffer (current_buf_ptr=0x%lx, count=%u) — device flush failed", - static_cast(buf_dev), count - ); - clean = false; - } - } - } - - uint64_t total_device = state->total_record_count; - uint64_t dropped_device = state->dropped_record_count; - uint64_t overflow_device = state->total_overflow_record_count; - - if (dropped_device > 0) { - LOG_WARN( - "dep_gen reconcile: %lu records dropped on device side (free_queue empty or ready_queue full). " - "Increase PLATFORM_DEP_GEN_BUFFERS_PER_INSTANCE / PLATFORM_DEP_GEN_READYQUEUE_SIZE if frequent. " - "deps.json will NOT be emitted for this run (incomplete graph).", - static_cast(dropped_device) - ); - clean = false; - } - // collected counts physical buffer slots; total_device counts submits; the - // chain expands submits into multiple slots, so the overflow counter - // bridges the two. - if (total_collected_ + dropped_device != total_device + overflow_device) { - LOG_WARN( - "dep_gen reconcile: record count mismatch (collected=%lu + dropped=%lu != device_total=%lu + " - "overflow=%lu, silent_loss=%ld)", - static_cast(total_collected_), static_cast(dropped_device), - static_cast(total_device), static_cast(overflow_device), - static_cast(total_device + overflow_device) - static_cast(total_collected_ + dropped_device) - ); - clean = false; - } else { - LOG_INFO_V0( - "dep_gen reconcile: counts match (collected=%lu, dropped=%lu, device_total=%lu, overflow=%lu)", - static_cast(total_collected_), static_cast(dropped_device), - static_cast(total_device), static_cast(overflow_device) - ); - } - - return clean; -} - -// --------------------------------------------------------------------------- -// finalize -// --------------------------------------------------------------------------- - -void DepGenCollector::finalize(DepGenUnregisterCallback unregister_cb, const DepGenFreeCallback &free_cb) { - if (!initialized_) return; - - stop(); - - { - std::scoped_lock lock(records_mutex_); - records_.clear(); - records_.shrink_to_fit(); - } - - // Same pattern as PmuCollector: walk owned buffers, then the free_queue - // and current_buf_ptr, releasing each unique device pointer once. - auto release_buf = [&](void *p) { - release_one_buffer(p, buffers_registered_ ? unregister_cb : nullptr, free_cb); - }; - manager_.release_owned_buffers(release_buf); - - if (shm_host_ != nullptr) { - std::unordered_set already_freed; - auto release_unique = [&](void *p) { - if (p == nullptr || !already_freed.insert(p).second) return; - release_buf(p); - }; - DepGenBufferState *state = dep_gen_state(0); - release_unique(reinterpret_cast(state->current_buf_ptr)); - state->current_buf_ptr = 0; - rmb(); - uint32_t head = state->free_queue.head; - uint32_t tail = state->free_queue.tail; - uint32_t queued = tail - head; - if (queued > PLATFORM_DEP_GEN_SLOT_COUNT) queued = PLATFORM_DEP_GEN_SLOT_COUNT; - for (uint32_t i = 0; i < queued; i++) { - uint32_t slot = (head + i) % PLATFORM_DEP_GEN_SLOT_COUNT; - release_unique(reinterpret_cast(state->free_queue.buffer_ptrs[slot])); - state->free_queue.buffer_ptrs[slot] = 0; - } - state->free_queue.head = tail; - } - manager_.clear_mappings(); - - if (shm_dev_ != nullptr) { - release_one_buffer(shm_dev_, shm_registered_ ? unregister_cb : nullptr, free_cb); - shm_dev_ = nullptr; - shm_host_ = nullptr; - } - - initialized_ = false; - buffers_registered_ = false; - shm_registered_ = false; - shm_size_ = 0; - total_collected_ = 0; - clear_memory_context(); - LOG_INFO_V0("DepGen collector finalized"); -} diff --git a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp b/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp deleted file mode 100644 index 6572a498b..000000000 --- a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp +++ /dev/null @@ -1,1036 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file l2_swimlane_collector.cpp - * @brief Performance data collector implementation. The mgmt-thread + buffer-pool - * machinery lives in profiling_common::BufferPoolManager parameterized by - * L2SwimlaneModule (host/l2_swimlane_collector.h); the poll loop lives in - * profiling_common::ProfilerBase. This file owns the per-buffer - * on_buffer_collected callback and the export logic. - */ - -#include "host/l2_swimlane_collector.h" - -#include -#include -#include -#include -#include -#include -#include - -#include "common/memory_barrier.h" -#include "common/unified_log.h" - -// ============================================================================= -// L2SwimlaneCollector Implementation -// ============================================================================= - -// Sched / orch phase records route through separate BufferKinds; no -// parse-time discriminator function is needed (the device-side type tag is -// the source of truth). - -L2SwimlaneCollector::~L2SwimlaneCollector() { - stop(); - if (shm_host_ != nullptr) { - LOG_WARN("L2SwimlaneCollector destroyed without finalize()"); - } -} - -void *L2SwimlaneCollector::alloc_single_buffer(size_t size, void **host_ptr_out) { - void *dev_ptr = alloc_cb_(size); - if (dev_ptr == nullptr) { - LOG_ERROR("Failed to allocate buffer (%zu bytes)", size); - *host_ptr_out = nullptr; - return nullptr; - } - - if (register_cb_ != nullptr) { - void *host_ptr = nullptr; - int rc = register_cb_(dev_ptr, size, device_id_, &host_ptr); - if (rc != 0 || host_ptr == nullptr) { - LOG_ERROR("Buffer registration failed: %d", rc); - *host_ptr_out = nullptr; - return nullptr; - } - *host_ptr_out = host_ptr; - } else { - *host_ptr_out = dev_ptr; - } - - // Register mapping so the BufferPoolManager can resolve dev→host - manager_.register_mapping(dev_ptr, *host_ptr_out); - return dev_ptr; -} - -int L2SwimlaneCollector::initialize( - int num_aicore, int aicpu_thread_num, int device_id, L2SwimlaneLevel l2_swimlane_level, - const L2SwimlaneAllocCallback &alloc_cb, L2SwimlaneRegisterCallback register_cb, - const L2SwimlaneFreeCallback &free_cb, const std::string &output_prefix -) { - if (shm_host_ != nullptr) { - LOG_ERROR("L2SwimlaneCollector already initialized"); - return -1; - } - - LOG_INFO_V0("Initializing performance profiling"); - - if (num_aicore <= 0 || num_aicore > PLATFORM_MAX_CORES) { - LOG_ERROR("Invalid number of AICores: %d (max=%d)", num_aicore, PLATFORM_MAX_CORES); - return -1; - } - - num_aicore_ = num_aicore; - aicpu_thread_num_ = aicpu_thread_num; - l2_swimlane_level_ = l2_swimlane_level; - output_prefix_ = output_prefix; - total_perf_collected_.store(0, std::memory_order_relaxed); - total_sched_phase_collected_.store(0, std::memory_order_relaxed); - total_orch_phase_collected_.store(0, std::memory_order_relaxed); - - // Stash the memory context on the base up-front so alloc_single_buffer - // sees consistent values during init. shm_host_ stays nullptr until the - // shm allocation succeeds — the nullptr guard makes a post-failure - // start(tf) a no-op. - set_memory_context( - alloc_cb, register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr, /*shm_dev=*/nullptr, - /*shm_host=*/nullptr, /*shm_size=*/0, device_id - ); - - // Step 1: Calculate shared memory size (slot arrays only, no actual - // buffers). Host over-allocates phase pool slots to the platform max for - // both sched and orch — AICPU picks the actual counts at init_phase time - // and writes them into the header. - int num_phase_threads = PLATFORM_MAX_AICPU_THREADS; - size_t total_size = calc_perf_data_size_with_phases(num_aicore, num_phase_threads, num_phase_threads); - - LOG_DEBUG("Shared memory allocation plan:"); - LOG_DEBUG(" Number of cores: %d", num_aicore); - LOG_DEBUG(" Header size: %zu bytes", sizeof(L2SwimlaneDataHeader)); - LOG_DEBUG(" L2SwimlaneAicpuTaskPool size: %zu bytes each", sizeof(L2SwimlaneAicpuTaskPool)); - LOG_DEBUG(" L2SwimlaneAicpuSchedPhasePool size: %zu bytes each", sizeof(L2SwimlaneAicpuSchedPhasePool)); - LOG_DEBUG(" L2SwimlaneAicpuOrchPhasePool size: %zu bytes each", sizeof(L2SwimlaneAicpuOrchPhasePool)); - LOG_DEBUG(" Total shared memory: %zu bytes (%zu KB)", total_size, total_size / 1024); - - // Step 2: Allocate shared memory for slot arrays - void *perf_dev_ptr = alloc_cb(total_size); - if (perf_dev_ptr == nullptr) { - LOG_ERROR("Failed to allocate shared memory (%zu bytes)", total_size); - return -1; - } - LOG_DEBUG("Allocated shared memory: %p", perf_dev_ptr); - - // Step 3: Register to host mapping (optional) - void *perf_host_ptr = nullptr; - if (register_cb != nullptr) { - int rc = register_cb(perf_dev_ptr, total_size, device_id, &perf_host_ptr); - if (rc != 0) { - LOG_ERROR("Memory registration failed: %d", rc); - return rc; - } - if (perf_host_ptr == nullptr) { - LOG_ERROR("register_cb succeeded but returned null host_ptr"); - return -1; - } - LOG_DEBUG("Mapped to host memory: %p", perf_host_ptr); - } else { - perf_host_ptr = perf_dev_ptr; - LOG_DEBUG("Simulation mode: host_ptr = dev_ptr = %p", perf_host_ptr); - } - - // Step 4: Initialize header - L2SwimlaneDataHeader *header = get_l2_swimlane_header(perf_host_ptr); - - for (int t = 0; t < PLATFORM_MAX_AICPU_THREADS; t++) { - memset(header->queues[t], 0, sizeof(header->queues[t])); - header->queue_heads[t] = 0; - header->queue_tails[t] = 0; - } - - header->num_cores = num_aicore; - header->l2_swimlane_level = static_cast(l2_swimlane_level_); - // Phase metadata: must be zero-initialized here. alloc_cb returns - // uninitialized device memory; AICPU only writes these fields when - // phase init runs (level >= SCHED_PHASES). Without zeroing, lower - // levels (AICORE_TIMING / AICPU_TIMING) leave garbage that - // for_each_instance iterates as `num_sched_phase_threads` / - // `num_orch_phase_threads`, walking off the end of the allocated pool - // array → segfault. The host-side reader (read_phase_header_metadata) - // and BufferPoolManager replenish loop both gate on these counts being - // sane values. - header->num_sched_phase_threads = 0; - header->num_orch_phase_threads = 0; - header->num_phase_cores = 0; - memset(header->core_to_thread, -1, sizeof(header->core_to_thread)); - - LOG_DEBUG("Initialized L2SwimlaneDataHeader:"); - LOG_DEBUG(" num_cores: %d", header->num_cores); - LOG_DEBUG(" l2_swimlane_level: %u", header->l2_swimlane_level); - LOG_DEBUG(" buffer_capacity: %d", PLATFORM_PROF_BUFFER_SIZE); - LOG_DEBUG(" queue capacity: %d", PLATFORM_PROF_READYQUEUE_SIZE); - - // Step 5: Initialize L2SwimlaneAicpuTaskPools. Seed as many buffers as - // the device-side free_queue can hold; any remaining buffers stay in the - // host recycled pool. - for (int i = 0; i < num_aicore; i++) { - L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(perf_host_ptr, i); - memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool)); - - state->free_queue.head = 0; - state->free_queue.tail = 0; - state->head.current_buf_ptr = 0; - state->head.current_buf_seq = 0; - - const int initial_free_count = (PLATFORM_PROF_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ? - PLATFORM_PROF_BUFFERS_PER_CORE : - PLATFORM_PROF_SLOT_COUNT; - for (int s = 0; s < PLATFORM_PROF_BUFFERS_PER_CORE; s++) { - void *host_buf_ptr = nullptr; - void *dev_buf_ptr = alloc_single_buffer(sizeof(L2SwimlaneAicpuTaskBuffer), &host_buf_ptr); - if (dev_buf_ptr == nullptr) { - LOG_ERROR("Failed to allocate L2SwimlaneAicpuTaskBuffer for core %d, buffer %d", i, s); - return -1; - } - L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast(host_buf_ptr); - memset(buf, 0, sizeof(L2SwimlaneAicpuTaskBuffer)); - buf->count = 0; - - if (s < initial_free_count) { - state->free_queue.buffer_ptrs[s] = reinterpret_cast(dev_buf_ptr); - } else { - manager_.push_recycled(static_cast(ProfBufferType::AICPU_TASK), dev_buf_ptr); - } - } - wmb(); - state->free_queue.tail = static_cast(initial_free_count); - wmb(); - } - - // Step 5b: Initialize L2SwimlaneAicoreTaskPools — per-core AICore rotation - // channel + buffer pool. Same SPSC pattern as the AICPU pool above. - for (int i = 0; i < num_aicore; i++) { - L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(perf_host_ptr, num_aicore, i); - memset(ac_state, 0, sizeof(L2SwimlaneAicoreTaskPool)); - - const int initial_free_count = (PLATFORM_AICORE_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ? - PLATFORM_AICORE_BUFFERS_PER_CORE : - PLATFORM_PROF_SLOT_COUNT; - for (int s = 0; s < PLATFORM_AICORE_BUFFERS_PER_CORE; s++) { - void *host_buf_ptr = nullptr; - void *dev_buf_ptr = alloc_single_buffer(sizeof(L2SwimlaneAicoreTaskBuffer), &host_buf_ptr); - if (dev_buf_ptr == nullptr) { - LOG_ERROR("Failed to allocate L2SwimlaneAicoreTaskBuffer for core %d, buffer %d", i, s); - return -1; - } - L2SwimlaneAicoreTaskBuffer *buf = reinterpret_cast(host_buf_ptr); - memset(buf, 0, sizeof(L2SwimlaneAicoreTaskBuffer)); - buf->count = 0; - - if (s < initial_free_count) { - ac_state->free_queue.buffer_ptrs[s] = reinterpret_cast(dev_buf_ptr); - } else { - manager_.push_recycled(static_cast(ProfBufferType::AICORE_TASK), dev_buf_ptr); - } - } - wmb(); - ac_state->free_queue.tail = static_cast(initial_free_count); - wmb(); - } - LOG_DEBUG( - "Initialized buffer pools: %d L2SwimlaneAicpuTaskBuffers/core + %d L2SwimlaneAicoreTaskBuffers/core (up to " - "%d in free_queue, rest in recycled pool)", - PLATFORM_PROF_BUFFERS_PER_CORE, PLATFORM_AICORE_BUFFERS_PER_CORE, PLATFORM_PROF_SLOT_COUNT - ); - - // Step 5c: Standalone uint64_t[num_aicore] table that will hold per-core - // L2SwimlaneActiveHead device addresses. Host only allocates the bytes and - // hands the device pointer to AICPU via KernelArgs::l2_swimlane_aicore_rotation_table; - // AICPU itself fills the entries inside `l2_swimlane_aicpu_init` (it has - // direct access to `&ac_state->head` device addresses, no - // host-to-device translation needed). AICore reads - // rotation_table[block_idx] at kernel entry. - { - size_t table_bytes = static_cast(num_aicore) * sizeof(uint64_t); - void *rotation_table_host = nullptr; - void *rotation_table_dev = alloc_single_buffer(table_bytes, &rotation_table_host); - if (rotation_table_dev == nullptr) { - LOG_ERROR("Failed to allocate l2_swimlane_aicore_rotation_table (rotation) table (%zu bytes)", table_bytes); - return -1; - } - aicore_ring_addr_table_dev_ = rotation_table_dev; - } - - // Step 6: Initialize per-thread phase pools — both sched and orch. Each - // pool is sized to its own PLATFORM_PROF_{SCHED,ORCH}_BUFFERS_PER_THREAD - // (seeded into free_queue up to slot capacity, rest in the recycled pool - // tagged by kind). Templated on the concrete TypedBuffer so the `count` - // zero-store uses the matching layout — sched and orch buffers have - // DIFFERENT sizes (64B vs 32B records), - // so a single cast type for both would land the count store past the end - // of the orch allocation and corrupt the heap. - // state_count pool states are zeroed (so the host's [0, PLATFORM_MAX) - // reconcile/iteration reads count=0 for unused slots); buffers are - // allocated only for the first buffer_count pools. For sched the two are - // equal; orch is a single instance (pool 0), so it zeroes all slots but - // allocates buffers for just pool 0 — no buffers wasted on unused slots. - auto init_phase_pools = [&](auto buffer_tag, L2SwimlaneAicpuTaskPool *(*get_state)(void *, int, int), - int state_count, int buffer_count, int buffers_per_thread, ProfBufferType recycle_kind, - const char *kind_label) -> int { - using Buffer = typename decltype(buffer_tag)::type; - constexpr size_t buffer_bytes = sizeof(Buffer); - for (int t = 0; t < state_count; t++) { - auto *state = get_state(perf_host_ptr, num_aicore, t); - memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool)); - if (t >= buffer_count) continue; // zeroed state only; no buffers (unused slot) - const int initial_free_count = - (buffers_per_thread < PLATFORM_PROF_SLOT_COUNT) ? buffers_per_thread : PLATFORM_PROF_SLOT_COUNT; - for (int s = 0; s < buffers_per_thread; s++) { - void *host_buf_ptr = nullptr; - void *dev_buf_ptr = alloc_single_buffer(buffer_bytes, &host_buf_ptr); - if (dev_buf_ptr == nullptr) { - LOG_ERROR("Failed to allocate %s phase buffer for thread %d, slot %d", kind_label, t, s); - return -1; - } - // Zero only the `count` word at the buffer's tail, using the - // matching Buffer type. The records payload is overwritten by - // AICPU on first use. - reinterpret_cast(host_buf_ptr)->count = 0; - if (s < initial_free_count) { - state->free_queue.buffer_ptrs[s] = reinterpret_cast(dev_buf_ptr); - } else { - manager_.push_recycled(static_cast(recycle_kind), dev_buf_ptr); - } - } - wmb(); - state->free_queue.tail = static_cast(initial_free_count); - wmb(); - } - return 0; - }; - - // Type tags so the templated lambda can deduce the buffer type without - // having to spell out an explicit template argument (not portable on a - // generic lambda before C++20 explicit template-parameter syntax). - struct SchedTag { - using type = L2SwimlaneAicpuSchedPhaseBuffer; - }; - struct OrchTag { - using type = L2SwimlaneAicpuOrchPhaseBuffer; - }; - - // Sched: actual scheduler-thread count is unknown at host-alloc time, so - // size buffers to the platform max. Orch: a single instance (pool 0), so - // allocate buffers for just one pool while still zeroing all MAX states. - if (init_phase_pools( - SchedTag{}, get_sched_phase_buffer_state, /*state_count=*/num_phase_threads, - /*buffer_count=*/num_phase_threads, /*buffers_per_thread=*/PLATFORM_PROF_SCHED_BUFFERS_PER_THREAD, - ProfBufferType::AICPU_SCHED_PHASE, "sched" - ) != 0) { - return -1; - } - auto orch_get_state = [](void *base, int n_cores, int t) { - return get_orch_phase_buffer_state(base, n_cores, t); - }; - if (init_phase_pools( - OrchTag{}, orch_get_state, /*state_count=*/num_phase_threads, /*buffer_count=*/1, - /*buffers_per_thread=*/PLATFORM_PROF_ORCH_BUFFERS_PER_THREAD, ProfBufferType::AICPU_ORCH_PHASE, "orch" - ) != 0) { - return -1; - } - LOG_DEBUG( - "Initialized %d sched (%d buf/thread) + 1 orch (%d buf) PhaseBufferStates (seeded up to %d free_queue " - "slots)", - num_phase_threads, PLATFORM_PROF_SCHED_BUFFERS_PER_THREAD, PLATFORM_PROF_ORCH_BUFFERS_PER_THREAD, - PLATFORM_PROF_SLOT_COUNT - ); - - wmb(); - - // Step 7: Stash device pointer for the caller to publish via - // kernel_args.l2_swimlane_data_base (read back via get_l2_swimlane_setup_device_ptr()). - LOG_DEBUG("L2 swimlane device base = 0x%lx", reinterpret_cast(perf_dev_ptr)); - - perf_shared_mem_dev_ = perf_dev_ptr; - // Refresh memory context with the now-known SHM tuple. start(tf) (inherited) - // gates on shm_host_, so this is the moment the collector becomes startable. - set_memory_context( - alloc_cb, register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr, perf_dev_ptr, perf_host_ptr, - total_size, device_id - ); - - collected_perf_records_.assign(num_aicore_, {}); - collected_aicore_records_.assign(num_aicore_, {}); - collected_sched_phase_records_.assign(PLATFORM_MAX_AICPU_THREADS, {}); - collected_orch_phase_records_.assign(PLATFORM_MAX_AICPU_THREADS, {}); - - LOG_INFO_V0("Performance profiling initialized (dynamic buffer mode)"); - return 0; -} - -// --------------------------------------------------------------------------- -// ProfilerBase callbacks -// --------------------------------------------------------------------------- - -void L2SwimlaneCollector::copy_perf_buffer(const ReadyBufferInfo &info) { - L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast(info.host_buffer_ptr); - rmb(); - uint32_t count = buf->count; - if (count > PLATFORM_PROF_BUFFER_SIZE) { - count = PLATFORM_PROF_BUFFER_SIZE; - } - uint32_t core_index = info.index; - if (core_index < static_cast(num_aicore_)) { - std::scoped_lock lock(perf_record_mutexes_[core_index]); - for (uint32_t i = 0; i < count; i++) { - collected_perf_records_[core_index].push_back(buf->records[i]); - } - total_perf_collected_.fetch_add(count, std::memory_order_relaxed); - } -} - -void L2SwimlaneCollector::copy_sched_phase_buffer(const ReadyBufferInfo &info) { - auto *buf = reinterpret_cast(info.host_buffer_ptr); - rmb(); - uint32_t count = buf->count; - if (count > static_cast(PLATFORM_PHASE_RECORDS_PER_THREAD)) { - count = PLATFORM_PHASE_RECORDS_PER_THREAD; - } - uint32_t tidx = info.index; - if (tidx < collected_sched_phase_records_.size()) { - std::scoped_lock lock(sched_phase_record_mutexes_[tidx]); - for (uint32_t i = 0; i < count; i++) { - collected_sched_phase_records_[tidx].push_back(buf->records[i]); - } - total_sched_phase_collected_.fetch_add(count, std::memory_order_relaxed); - if (count > 0) { - has_phase_data_.store(true, std::memory_order_relaxed); - } - } -} - -void L2SwimlaneCollector::copy_orch_phase_buffer(const ReadyBufferInfo &info) { - auto *buf = reinterpret_cast(info.host_buffer_ptr); - rmb(); - uint32_t count = buf->count; - if (count > static_cast(PLATFORM_PHASE_RECORDS_PER_THREAD)) { - count = PLATFORM_PHASE_RECORDS_PER_THREAD; - } - uint32_t tidx = info.index; - if (tidx < collected_orch_phase_records_.size()) { - std::scoped_lock lock(orch_phase_record_mutexes_[tidx]); - for (uint32_t i = 0; i < count; i++) { - collected_orch_phase_records_[tidx].push_back(buf->records[i]); - } - total_orch_phase_collected_.fetch_add(count, std::memory_order_relaxed); - if (count > 0) { - has_phase_data_.store(true, std::memory_order_relaxed); - } - } -} - -// AICore record buffers arrive on the ready queue in per-core rotation order -// (AICPU enqueues them at PLATFORM_AICORE_BUFFER_SIZE dispatch boundaries + -// once at flush). Within a single buffer, AICore wrote records[0..buf->count) -// in the order tasks ran on that core (completion-before-dispatch invariant -// + AICPU stamps buf->count just before enqueue). Flattening in arrival -// order gives us the per-core task stream that join_aicore_records() -// indexes by reg_task_id. -// -// Defensive filter: skip records whose `start_time == 0`. AICore writes -// `get_sys_cnt_aicore()` (a free-running cycle counter, always non-zero in -// practice) at task end, so a zero start_time means the slot was never -// written by AICore for this session. This handles two edge cases without -// special-casing them: -// - Recycled buffer where AICore wrote fewer records than the count stamp -// (e.g., the rare dispatch-boundary race for sub-microsecond kernels -// where AICore's next record_task fires before AICPU's rotation has -// propagated). The "missing" slot's previous contents are zero because -// allocate_single_buffer memsets at allocation. -// - Flush-path partial buffer whose tail wasn't reached. -void L2SwimlaneCollector::copy_aicore_buffer(const ReadyBufferInfo &info) { - L2SwimlaneAicoreTaskBuffer *buf = reinterpret_cast(info.host_buffer_ptr); - rmb(); - uint32_t core_index = info.index; - if (core_index >= static_cast(num_aicore_)) { - return; - } - uint32_t count = buf->count; - if (count > static_cast(PLATFORM_AICORE_BUFFER_SIZE)) { - count = PLATFORM_AICORE_BUFFER_SIZE; - } - uint32_t skipped = 0; - { - std::scoped_lock lock(aicore_record_mutexes_[core_index]); - auto &dst = collected_aicore_records_[core_index]; - dst.reserve(dst.size() + count); - for (uint32_t i = 0; i < count; i++) { - const L2SwimlaneAicoreTaskRecord &r = buf->records[i]; - if (r.start_time == 0) { - skipped++; - continue; - } - dst.push_back(r); - } - } - if (skipped > 0) { - LOG_WARN( - "Core %u: skipped %u AICore record slot(s) with start_time=0 (race-window write or " - "recycled-buffer tail). buf seq=%u count=%u", - core_index, skipped, info.buffer_seq, count - ); - } -} - -void L2SwimlaneCollector::on_buffer_collected(const ReadyBufferInfo &info) { - switch (info.type) { - case ProfBufferType::AICPU_TASK: - copy_perf_buffer(info); - break; - case ProfBufferType::AICPU_SCHED_PHASE: - copy_sched_phase_buffer(info); - break; - case ProfBufferType::AICPU_ORCH_PHASE: - copy_orch_phase_buffer(info); - break; - case ProfBufferType::AICORE_TASK: - copy_aicore_buffer(info); - break; - } -} - -// --------------------------------------------------------------------------- -// reconcile_counters / read_phase_header_metadata -// --------------------------------------------------------------------------- -// -// Host never recovers records from device-side current_buf_ptr. Device flush -// is the only data path: a flush failure must bump dropped_record_count and -// clear current_buf_ptr on the device side. Host's job here is purely -// accounting + sanity check. - -void L2SwimlaneCollector::reconcile_counters() { - if (shm_host_ == nullptr) { - return; - } - - rmb(); - - // Two-bucket invariant (post-AICore-as-producer): every commit attempt - // bumps total_record_count; capacity-driven drops (no free buffer / - // queue full / flush failure) bump dropped_record_count. - // silent_loss = device_total - (collected + dropped) - // and any non-zero silent loss flags an unaccounted gap on top of the - // already-classified dropped losses. - // - // Sanity sub-check: after stop(), any active buffer with records must - // have been flushed by AICPU (success → current_buf_ptr=0; failure → - // bump dropped, clear count + current_buf_ptr). A non-zero pointer with - // non-zero count means records AICPU neither delivered nor accounted - // for — i.e. a device-side flush bug. Empty buffers (count=0, never - // written) are fine; AICPU's flush legitimately skips them. - auto reconcile_one = [&](const char *kind, const char *unit_name, int unit_count, auto get_state, - auto read_buf_count, uint64_t collected, bool optional) { - int leftover_active = 0; - for (int i = 0; i < unit_count; i++) { - L2SwimlaneAicpuTaskPool *state = get_state(i); - uint64_t buf_ptr = state->head.current_buf_ptr; - if (buf_ptr == 0) continue; - void *host_ptr = manager_.resolve_host_ptr(reinterpret_cast(buf_ptr)); - if (host_ptr == nullptr) continue; - uint32_t count = read_buf_count(host_ptr); - if (count == 0) continue; - LOG_ERROR( - "L2Swimlane reconcile: %s %d has un-flushed %s buffer (current_buf_ptr=0x%lx, count=%u) " - "after stop() — device flush failed", - unit_name, i, kind, static_cast(buf_ptr), count - ); - leftover_active++; - } - - uint64_t total_device = 0; - uint64_t dropped_device = 0; - for (int i = 0; i < unit_count; i++) { - L2SwimlaneAicpuTaskPool *state = get_state(i); - total_device += state->head.total_record_count; - dropped_device += state->head.dropped_record_count; - } - - // PHASE counters are populated only by runtimes that actually emit - // phase records; skip the comparison entirely when nothing happened. - if (optional && total_device == 0 && collected == 0 && dropped_device == 0) { - return; - } - - if (dropped_device > 0) { - LOG_WARN( - "L2Swimlane reconcile: %lu %s records dropped on device side.", - static_cast(dropped_device), kind - ); - } - uint64_t accounted = collected + dropped_device; - if (accounted != total_device) { - LOG_WARN( - "L2Swimlane reconcile: %s count mismatch (collected=%lu + dropped=%lu != " - "device_total=%lu, silent_loss=%ld)", - kind, static_cast(collected), static_cast(dropped_device), - static_cast(total_device), static_cast(total_device) - static_cast(accounted) - ); - } else { - LOG_INFO_V0( - "L2Swimlane reconcile: %s counts match (collected=%lu, dropped=%lu, device_total=%lu)", kind, - static_cast(collected), static_cast(dropped_device), - static_cast(total_device) - ); - } - - if (leftover_active > 0) { - LOG_ERROR( - "L2Swimlane reconcile: %d %s(s) had un-cleared %s current_buf_ptr — see prior errors", leftover_active, - unit_name, kind - ); - } - }; - - reconcile_one( - "PERF", "core", num_aicore_, - [this](int core_index) { - return get_perf_buffer_state(shm_host_, core_index); - }, - [](void *host_ptr) { - return reinterpret_cast(host_ptr)->count; - }, - total_perf_collected_.load(std::memory_order_relaxed), /*optional=*/false - ); - - reconcile_one( - "SCHED_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS, - [this](int thread_index) { - return get_sched_phase_buffer_state(shm_host_, num_aicore_, thread_index); - }, - [](void *host_ptr) { - return reinterpret_cast(host_ptr)->count; - }, - total_sched_phase_collected_.load(std::memory_order_relaxed), /*optional=*/true - ); - - reconcile_one( - "ORCH_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS, - [this](int thread_index) { - return get_orch_phase_buffer_state(shm_host_, num_aicore_, thread_index); - }, - [](void *host_ptr) { - return reinterpret_cast(host_ptr)->count; - }, - total_orch_phase_collected_.load(std::memory_order_relaxed), /*optional=*/true - ); -} - -void L2SwimlaneCollector::read_phase_header_metadata() { - if (shm_host_ == nullptr) { - return; - } - - rmb(); - - L2SwimlaneDataHeader *header = get_l2_swimlane_header(shm_host_); - - int num_sched = static_cast(header->num_sched_phase_threads); - int num_orch = static_cast(header->num_orch_phase_threads); - if (num_sched == 0 && num_orch == 0) { - LOG_INFO_V0("No phase profiling data found (sched/orch phase thread counts both 0; phase init never ran)"); - return; - } - if (num_sched > PLATFORM_MAX_AICPU_THREADS || num_orch > PLATFORM_MAX_AICPU_THREADS) { - LOG_ERROR( - "Invalid phase thread counts from shared memory (sched=%d, orch=%d, max=%d)", num_sched, num_orch, - PLATFORM_MAX_AICPU_THREADS - ); - return; - } - // Scheduler threads occupy AICPU threads [0, num_sched); the dedicated - // orchestrator runs on the last AICPU thread (aicpu_thread_num_ - 1). The - // orch-phase pool is a single instance, so its pool index does not encode - // the AICPU thread — derive the thread number from aicpu_thread_num_. - // aicpu_thread_num_ is >= 1 (DeviceRunner::run validates launch_aicpu_num in - // [1, PLATFORM_MAX_AICPU_THREADS] before initialize()), so the subtraction - // can't go negative. This is a log-only display value, never an index. - const int orch_thread = aicpu_thread_num_ - 1; - LOG_INFO_V0( - "Collecting phase metadata: scheduler threads 0-%d, orchestrator thread %d", num_sched - 1, orch_thread - ); - - for (size_t t = 0; t < collected_sched_phase_records_.size(); t++) { - if (!collected_sched_phase_records_[t].empty()) { - LOG_INFO_V0(" Sched thread %zu: %zu records", t, collected_sched_phase_records_[t].size()); - } - } - for (size_t t = 0; t < collected_orch_phase_records_.size(); t++) { - if (!collected_orch_phase_records_[t].empty()) { - LOG_INFO_V0(" Orch thread %d: %zu records", orch_thread, collected_orch_phase_records_[t].size()); - } - } - - // has_phase_data_ is set by copy_sched_phase_buffer / copy_orch_phase_buffer - // during the drain — every push goes through those call sites and toggles - // the flag. No re-scan needed here. - - // Core-to-thread mapping (header-resident; not buffered). - int num_phase_cores = static_cast(header->num_phase_cores); - if (num_phase_cores > 0 && num_phase_cores <= PLATFORM_MAX_CORES) { - core_to_thread_.assign(header->core_to_thread, header->core_to_thread + num_phase_cores); - LOG_INFO_V0(" Core-to-thread mapping: %d cores", num_phase_cores); - } - - LOG_INFO_V0( - "Phase metadata collection complete: has_phase_data=%s", - has_phase_data_.load(std::memory_order_relaxed) ? "yes" : "no" - ); -} - -void L2SwimlaneCollector::set_core_types(const CoreType *types, int n) { - if (types == nullptr || n <= 0) { - core_types_.clear(); - return; - } - core_types_.assign(types, types + n); -} - -// JSON v2 emit: the host now dumps raw cycle-domain per-stream records plus -// metadata, and `swimlane_converter.py` performs the join (AICore↔AICPU on -// reg_task_id, base_time normalization, cycles→µs conversion, sort, core_type -// lookup, func_id resolution against deps.json). Moving the join into Python -// makes the schema easy to evolve without round-tripping through C++ + a -// rebuild, and shrinks this file to a pure dump. -int L2SwimlaneCollector::export_swimlane_json() { - if (shm_host_ == nullptr) { - return -1; - } - - // Empty-export guard: nothing useful on disk if every per-stream source is - // empty. AICPU_TIMING+ relies on `collected_perf_records_`; AICORE_TIMING - // (level=1) relies on `collected_aicore_records_` alone. - bool has_any_records = false; - for (const auto &core_records : collected_perf_records_) { - if (!core_records.empty()) { - has_any_records = true; - break; - } - } - if (!has_any_records) { - for (const auto &ac_records : collected_aicore_records_) { - if (!ac_records.empty()) { - has_any_records = true; - break; - } - } - } - if (!has_any_records) { - LOG_WARN("Warning: No performance data to export."); - return -1; - } - - std::error_code ec; - std::filesystem::create_directories(output_prefix_, ec); - if (ec) { - LOG_ERROR("Error: Failed to create output directory %s: %s", output_prefix_.c_str(), ec.message().c_str()); - return -1; - } - - std::string filepath = output_prefix_ + "/l2_swimlane_records.json"; - std::ofstream outfile(filepath); - if (!outfile.is_open()) { - LOG_ERROR("Error: Failed to open file: %s", filepath.c_str()); - return -1; - } - - int l2_swimlane_level = static_cast(l2_swimlane_level_); - - outfile << "{\n"; - outfile << " \"l2_swimlane_level\": " << l2_swimlane_level << ",\n"; - - // metadata: everything python needs that isn't in a per-record stream. - // clock_freq_hz drives the cycles→µs conversion (a2a3 = 50 MHz, a5 = - // 1 GHz — must come from the host, not be hardcoded in python). - outfile << " \"metadata\": {\n"; - outfile << " \"clock_freq_hz\": " << PLATFORM_PROF_SYS_CNT_FREQ << ",\n"; - outfile << " \"num_cores\": " << num_aicore_ << ",\n"; - outfile << " \"core_types\": ["; - for (int i = 0; i < num_aicore_; i++) { - CoreType ct = (i < static_cast(core_types_.size())) ? core_types_[i] : CoreType::AIV; - if (i > 0) outfile << ", "; - outfile << "\"" << ((ct == CoreType::AIC) ? "aic" : "aiv") << "\""; - } - outfile << "]"; - if (!core_to_thread_.empty()) { - outfile << ",\n \"core_to_thread\": ["; - for (size_t i = 0; i < core_to_thread_.size(); i++) { - if (i > 0) outfile << ", "; - outfile << static_cast(core_to_thread_[i]); - } - outfile << "]"; - } - outfile << "\n },\n"; - - // Per-stream raw records. Flat array of tuples — compact at scale (a real - // PA trace has ~100K records, and per-field JSON keys would dominate the - // file size). Column order is documented in the schema comment at the top - // of swimlane_converter.py's v2 reader. - // - // aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, receive_to_start_cycles] - // aicpu_tasks: [core_id, reg_task_id, dispatch_cycles, finish_cycles] - { - // copy_aicore_buffer already drops r.start_time == 0 slots when - // collecting from the device side, so no defensive filter here. - outfile << " \"aicore_tasks\": ["; - bool first = true; - size_t total = 0; - for (size_t core_idx = 0; core_idx < collected_aicore_records_.size(); core_idx++) { - for (const auto &r : collected_aicore_records_[core_idx]) { - if (!first) outfile << ","; - outfile << "\n [" << core_idx << ", " << r.task_token_raw << ", " << r.reg_task_id << ", " - << r.start_time << ", " << r.end_time << ", " << r.receive_to_start_cycles << "]"; - first = false; - total++; - } - } - if (!first) outfile << "\n "; - outfile << "]"; - LOG_INFO_V0(" aicore_tasks: %zu records", total); - } - { - outfile << ",\n \"aicpu_tasks\": ["; - bool first = true; - size_t total = 0; - for (size_t core_idx = 0; core_idx < collected_perf_records_.size(); core_idx++) { - for (const auto &r : collected_perf_records_[core_idx]) { - if (!first) outfile << ","; - outfile << "\n [" << core_idx << ", " << r.reg_task_id << ", " << r.dispatch_time << ", " - << r.finish_time << "]"; - first = false; - total++; - } - } - if (!first) outfile << "\n "; - outfile << "]"; - LOG_INFO_V0(" aicpu_tasks: %zu records", total); - } - - // Phase records keep their per-thread sub-array shape so the python - // consumer's existing iteration pattern (one thread per inner list) stays - // unchanged; only the field names move from *_us to *_cycles. - if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) { - auto sched_phase_name = [](L2SwimlaneSchedPhaseKind kind) -> const char * { - switch (kind) { - case L2SwimlaneSchedPhaseKind::Complete: - return "complete"; - case L2SwimlaneSchedPhaseKind::Dispatch: - return "dispatch"; - case L2SwimlaneSchedPhaseKind::Release: - return "release"; - case L2SwimlaneSchedPhaseKind::Wire: - return "wire"; - case L2SwimlaneSchedPhaseKind::Dummy: - return "dummy"; - case L2SwimlaneSchedPhaseKind::EarlyDispatch: - return "early_dispatch"; - case L2SwimlaneSchedPhaseKind::Resolve: - return "resolve"; - case L2SwimlaneSchedPhaseKind::DummyTask: - return "dummy_task"; - } - return "unknown"; - }; - - auto emit_depth_array = [&outfile](const char *key, const int16_t arr[L2SWIMLANE_NUM_QUEUE_SHAPES]) { - outfile << ", \"" << key << "\": [" << arr[0] << "," << arr[1] << "," << arr[2] << "]"; - }; - outfile << ",\n \"aicpu_scheduler_phases\": [\n"; - for (size_t t = 0; t < collected_sched_phase_records_.size(); t++) { - outfile << " ["; - bool first = true; - for (const auto &pr : collected_sched_phase_records_[t]) { - if (!first) outfile << ","; - outfile << "\n {\"kind\": \"" << sched_phase_name(pr.kind) << "\"" - << ", \"start_cycles\": " << pr.start_time << ", \"end_cycles\": " << pr.end_time - << ", \"loop_iter\": " << pr.loop_iter << ", \"tasks_processed\": " << pr.tasks_processed; - if (pr.kind == L2SwimlaneSchedPhaseKind::Dispatch) { - outfile << ", \"pop_hit\": " << pr.pop_hit << ", \"pop_miss\": " << pr.pop_miss; - } - // Queue-depth snapshots — [AIC, AIV, MIX] per L2SwimlaneAicpuSchedPhaseRecord docstring. - emit_depth_array("shared_at_start", pr.shared_depth_at_start); - emit_depth_array("shared_at_end", pr.shared_depth_at_end); - outfile << "}"; - first = false; - } - if (!first) outfile << "\n "; - outfile << "]"; - if (t < collected_sched_phase_records_.size() - 1) outfile << ","; - outfile << "\n"; - } - outfile << " ]"; - - bool has_orch_phases = false; - if (l2_swimlane_level_ >= L2SwimlaneLevel::ORCH_PHASES) { - for (const auto &v : collected_orch_phase_records_) { - if (!v.empty()) { - has_orch_phases = true; - break; - } - } - } - if (has_orch_phases) { - size_t orch_lanes = static_cast(get_l2_swimlane_header(shm_host_)->num_orch_phase_threads); - if (orch_lanes == 0 || orch_lanes > collected_orch_phase_records_.size()) { - orch_lanes = collected_orch_phase_records_.size(); - } - outfile << ",\n \"aicpu_orchestrator_phases\": [\n"; - for (size_t t = 0; t < orch_lanes; t++) { - outfile << " ["; - bool first = true; - for (const auto &pr : collected_orch_phase_records_[t]) { - if (!first) outfile << ","; - outfile << "\n {\"submit_idx\": " << pr.submit_idx << ", \"task_id\": " << pr.task_id - << ", \"start_cycles\": " << pr.start_time << ", \"end_cycles\": " << pr.end_time << "}"; - first = false; - } - if (!first) outfile << "\n "; - outfile << "]"; - if (t < orch_lanes - 1) outfile << ","; - outfile << "\n"; - } - outfile << " ]"; - } - } - - outfile << "\n}\n"; - outfile.close(); - - if (!outfile) { - LOG_ERROR("Failed to write JSON file (stream error): %s", filepath.c_str()); - return -1; - } - - LOG_INFO_V0("=== JSON Export Complete ==="); - LOG_INFO_V0("File: %s", filepath.c_str()); - - return 0; -} - -int L2SwimlaneCollector::finalize(L2SwimlaneUnregisterCallback unregister_cb, const L2SwimlaneFreeCallback &free_cb) { - if (shm_host_ == nullptr) { - return 0; - } - - // Stop mgmt + collector threads if the caller didn't already (idempotent). - stop(); - - LOG_DEBUG("Cleaning up performance profiling resources"); - - // Every release site below goes through release_one_buffer so the - // unregister and free are an inseparable pair — each dev_ptr that - // alloc_single_buffer installed via halHostRegister is unregistered - // before its device memory is freed. Without this the Ascend HAL's - // per-device registration table accumulates leaked entries across - // init_l2_swimlane() invocations and back-to-back l2_swimlane tests on - // a reused Worker fail at rc=8 from halHostRegister. - - // Free standalone l2_swimlane_aicore_rotation_table table - release_one_buffer(aicore_ring_addr_table_dev_, unregister_cb, free_cb); - aicore_ring_addr_table_dev_ = nullptr; - - // Release framework-owned buffers (recycled pools, done_queue, ready_queue). - manager_.release_owned_buffers([this, unregister_cb, free_cb](void *p) { - release_one_buffer(p, unregister_cb, free_cb); - }); - - // Per-core: current buffer + free_queue slots — these were owned by - // the AICPU side, not the framework. Same drain pattern for both the - // L2SwimlaneAicpuTaskBuffer pool and the L2SwimlaneAicoreTaskBuffer pool. - auto drain_free_queue = [&](L2SwimlaneFreeQueue &fq) { - rmb(); - uint32_t head = fq.head; - uint32_t tail = fq.tail; - uint32_t queued = tail - head; - if (queued > PLATFORM_PROF_SLOT_COUNT) { - queued = PLATFORM_PROF_SLOT_COUNT; - } - for (uint32_t k = 0; k < queued; k++) { - uint32_t slot = (head + k) % PLATFORM_PROF_SLOT_COUNT; - release_one_buffer(reinterpret_cast(fq.buffer_ptrs[slot]), unregister_cb, free_cb); - fq.buffer_ptrs[slot] = 0; - } - fq.head = tail; - }; - - for (int i = 0; i < num_aicore_; i++) { - L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(shm_host_, i); - release_one_buffer(reinterpret_cast(state->head.current_buf_ptr), unregister_cb, free_cb); - state->head.current_buf_ptr = 0; - drain_free_queue(state->free_queue); - - L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(shm_host_, num_aicore_, i); - release_one_buffer(reinterpret_cast(ac_state->head.current_buf_ptr), unregister_cb, free_cb); - ac_state->head.current_buf_ptr = 0; - drain_free_queue(ac_state->free_queue); - } - - auto release_phase_pool = [&](L2SwimlaneAicpuTaskPool *state) { - release_one_buffer(reinterpret_cast(state->head.current_buf_ptr), unregister_cb, free_cb); - state->head.current_buf_ptr = 0; - - rmb(); - uint32_t head = state->free_queue.head; - uint32_t tail = state->free_queue.tail; - uint32_t queued = tail - head; - if (queued > PLATFORM_PROF_SLOT_COUNT) { - queued = PLATFORM_PROF_SLOT_COUNT; - } - for (uint32_t k = 0; k < queued; k++) { - uint32_t slot = (head + k) % PLATFORM_PROF_SLOT_COUNT; - release_one_buffer(reinterpret_cast(state->free_queue.buffer_ptrs[slot]), unregister_cb, free_cb); - state->free_queue.buffer_ptrs[slot] = 0; - } - state->free_queue.head = tail; - }; - int num_phase_threads = PLATFORM_MAX_AICPU_THREADS; - for (int t = 0; t < num_phase_threads; t++) { - release_phase_pool(get_sched_phase_buffer_state(shm_host_, num_aicore_, t)); - } - for (int t = 0; t < num_phase_threads; t++) { - release_phase_pool(get_orch_phase_buffer_state(shm_host_, num_aicore_, t)); - } - - // Main shm: unregister + free as a pair, same as every other buffer. - // ProfilerBase's set_memory_context handed register_cb == nullptr iff the - // caller doesn't intend to register, so checking unregister_cb inside - // release_one_buffer is sufficient — no separate ``was_registered_`` flag. - release_one_buffer(perf_shared_mem_dev_, unregister_cb, free_cb); - LOG_DEBUG("Main shm released"); - - perf_shared_mem_dev_ = nullptr; - // shm_host_ aliases freed device/host memory now; null it so is_initialized() - // reports false, the dtor's "destroyed without finalize()" warning stays - // quiet, and a re-entrant finalize() / re-init hits the early-out instead of - // walking freed buffer state. Mirrors PMU/DepGen/TensorDump collectors. - shm_host_ = nullptr; - collected_perf_records_.clear(); - collected_sched_phase_records_.clear(); - collected_orch_phase_records_.clear(); - core_to_thread_.clear(); - has_phase_data_.store(false, std::memory_order_relaxed); - total_perf_collected_.store(0, std::memory_order_relaxed); - total_sched_phase_collected_.store(0, std::memory_order_relaxed); - total_orch_phase_collected_.store(0, std::memory_order_relaxed); - clear_memory_context(); - - LOG_DEBUG("Performance profiling cleanup complete"); - return 0; -} diff --git a/src/a2a3/platform/sim/host/CMakeLists.txt b/src/a2a3/platform/sim/host/CMakeLists.txt index 3b9f283f0..40813f369 100644 --- a/src/a2a3/platform/sim/host/CMakeLists.txt +++ b/src/a2a3/platform/sim/host/CMakeLists.txt @@ -45,10 +45,10 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/profiling_copy.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/sim/aicpu/platform_aicpu_affinity.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform_comm/comm_sim.cpp" diff --git a/src/a5/platform/onboard/host/CMakeLists.txt b/src/a5/platform/onboard/host/CMakeLists.txt index 7450a1eb9..177abdecf 100644 --- a/src/a5/platform/onboard/host/CMakeLists.txt +++ b/src/a5/platform/onboard/host/CMakeLists.txt @@ -43,9 +43,9 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/host_regs.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/comm_hccl.cpp" diff --git a/src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp b/src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp deleted file mode 100644 index e462c7579..000000000 --- a/src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp +++ /dev/null @@ -1,432 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file dep_gen_collector_aicpu.cpp - * @brief AICPU-side dep_gen capture implementation - * - * Single-instance: dep_gen captures the orchestrator's submit_task stream, - * so there is one BufferState and one current_buf — no per-core arrays. - * - * Buffer switching (SPSC): - * - Host pushes free DepGenBuffers via free_queue. - * - AICPU pops when current buffer fills; pushes full buffer to per-thread - * ready_queue (indexed by orch_thread_idx). - * - Full buffers are published before AICPU tries to recover a replacement. - * If recovery is delayed, later records are counted as dropped until host - * replenishes free_queue. Host reads dropped at finalize to decide whether - * to emit deps.json. - */ - -#include "aicpu/dep_gen_collector_aicpu.h" - -#include - -#include "aicpu/device_time.h" -#include "common/memory_barrier.h" -#include "common/platform_config.h" -#include "common/unified_log.h" - -static uint64_t g_platform_dep_gen_base = 0; -static bool g_enable_dep_gen = false; - -// File-local cached state for the single dep_gen instance (the orchestrator). -static DepGenDataHeader *s_dep_gen_header = nullptr; -static DepGenBufferState *s_dep_gen_state = nullptr; -static int s_orch_thread_idx = -1; // set via dep_gen_aicpu_set_orch_thread_idx - -static constexpr uint64_t kDepGenQueueBackpressureWaitCycles = PLATFORM_PROF_SYS_CNT_FREQ / 50000; // 20 us - -extern "C" void set_platform_dep_gen_base(uint64_t dep_gen_data_base) { g_platform_dep_gen_base = dep_gen_data_base; } - -extern "C" uint64_t get_platform_dep_gen_base() { return g_platform_dep_gen_base; } - -extern "C" void set_dep_gen_enabled(bool enable) { g_enable_dep_gen = enable; } - -extern "C" bool is_dep_gen_enabled() { return g_enable_dep_gen; } - -void dep_gen_aicpu_set_orch_thread_idx(int thread_idx) { s_orch_thread_idx = thread_idx; } - -// --------------------------------------------------------------------------- -// Internal: enqueue full buffer to per-thread ready_queue -// --------------------------------------------------------------------------- - -static bool -wait_for_ready_queue_space(DepGenDataHeader *header, int thread_idx, uint32_t *tail_out, uint32_t *head_out) { - if (header == nullptr || thread_idx < 0 || thread_idx >= PLATFORM_MAX_AICPU_THREADS) { - return false; - } - const uint32_t capacity = PLATFORM_DEP_GEN_READYQUEUE_SIZE; - const uint64_t start = get_sys_cnt_aicpu(); - - do { - uint32_t current_tail = header->queue_tails[thread_idx]; - uint32_t current_head = header->queue_heads[thread_idx]; - uint32_t next_tail = (current_tail + 1) % capacity; - if (next_tail != current_head) { - *tail_out = current_tail; - *head_out = current_head; - return true; - } - if (get_sys_cnt_aicpu() - start >= kDepGenQueueBackpressureWaitCycles) { - break; - } - } while (true); - return false; -} - -static bool wait_for_free_queue_entry(DepGenFreeQueue *free_queue, uint32_t *head_out, uint32_t *tail_out) { - if (free_queue == nullptr) { - return false; - } - const uint64_t start = get_sys_cnt_aicpu(); - - do { - uint32_t head = free_queue->head; - uint32_t tail = free_queue->tail; - if (head != tail) { - *head_out = head; - *tail_out = tail; - rmb(); // acquire: order the tail read above before the caller's buffer_ptrs read - return true; - } - if (get_sys_cnt_aicpu() - start >= kDepGenQueueBackpressureWaitCycles) { - break; - } - } while (true); - return false; -} - -static int enqueue_dep_gen_ready_buffer(uint64_t buffer_ptr, uint32_t buffer_seq) { - int q = s_orch_thread_idx; - uint32_t capacity = PLATFORM_DEP_GEN_READYQUEUE_SIZE; - uint32_t current_tail = 0; - uint32_t current_head = 0; - if (!wait_for_ready_queue_space(s_dep_gen_header, q, ¤t_tail, ¤t_head)) { - return -1; - } - - uint32_t next_tail = (current_tail + 1) % capacity; - s_dep_gen_header->queues[q][current_tail].instance_index = 0; - s_dep_gen_header->queues[q][current_tail].buffer_ptr = buffer_ptr; - s_dep_gen_header->queues[q][current_tail].buffer_seq = buffer_seq; - wmb(); // publish: entry fields visible before the tail advance - s_dep_gen_header->queue_tails[q] = next_tail; - return 0; -} - -static DepGenBuffer *try_pop_dep_gen_buffer(uint32_t next_seq) { - if (s_dep_gen_state == nullptr) { - return nullptr; - } - uint32_t head = 0; - uint32_t tail = 0; - if (!wait_for_free_queue_entry(&s_dep_gen_state->free_queue, &head, &tail)) { - return nullptr; - } - - uint64_t new_buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT]; - s_dep_gen_state->free_queue.head = head + 1; - if (new_buf_ptr == 0) { - return nullptr; - } - - DepGenBuffer *new_buf = reinterpret_cast(new_buf_ptr); - new_buf->count = 0; - s_dep_gen_state->current_buf_ptr = new_buf_ptr; - s_dep_gen_state->current_buf_seq = next_seq; - wmb(); - return new_buf; -} - -// --------------------------------------------------------------------------- -// Internal: switch the current buffer -// --------------------------------------------------------------------------- - -static void dep_gen_switch_buffer() { - if (s_dep_gen_state == nullptr) { - return; - } - DepGenBuffer *full_buf = reinterpret_cast(s_dep_gen_state->current_buf_ptr); - if (full_buf == nullptr) { - return; - } - - uint32_t seq = s_dep_gen_state->current_buf_seq; - int rc = enqueue_dep_gen_ready_buffer(s_dep_gen_state->current_buf_ptr, seq); - if (rc != 0) { - LOG_ERROR("dep_gen: failed to enqueue full buffer (ready_queue full), %u records dropped", full_buf->count); - s_dep_gen_state->dropped_record_count += full_buf->count; - full_buf->count = 0; - wmb(); - return; - } - - uint32_t next_seq = seq + 1; - s_dep_gen_state->current_buf_ptr = 0; - s_dep_gen_state->current_buf_seq = next_seq; - wmb(); - - (void)try_pop_dep_gen_buffer(next_seq); -} - -// --------------------------------------------------------------------------- -// Public interface -// --------------------------------------------------------------------------- - -void dep_gen_aicpu_init() { - void *base = reinterpret_cast(get_platform_dep_gen_base()); - if (base == nullptr) { - LOG_ERROR("dep_gen_aicpu_init: dep_gen_data_base is NULL"); - return; - } - s_dep_gen_header = get_dep_gen_header(base); - s_dep_gen_state = get_dep_gen_buffer_state(base, /*instance_index=*/0); - - rmb(); - uint32_t head = s_dep_gen_state->free_queue.head; - uint32_t tail = s_dep_gen_state->free_queue.tail; - - if (head != tail) { - (void)try_pop_dep_gen_buffer(0); - uint64_t buf_ptr = s_dep_gen_state->current_buf_ptr; - LOG_INFO_V0("dep_gen: popped initial buffer addr=0x%lx", buf_ptr); - } else { - LOG_ERROR("dep_gen: free_queue empty during init"); - s_dep_gen_state->current_buf_ptr = 0; - } - wmb(); -} - -void dep_gen_aicpu_record_submit( - uint64_t task_id_raw, bool in_manual_scope, int tensor_count, const void *const *tensor_ptrs, - const uint8_t *arg_types, int explicit_dep_count, const uint64_t *explicit_deps_raw, int block_num, - const int32_t kernel_ids[3] -) { - if (!g_enable_dep_gen || s_dep_gen_state == nullptr) { - return; - } - - // Account every attempted record so total == collected + dropped on host. - s_dep_gen_state->total_record_count += 1; - - int dc = explicit_dep_count; - if (dc < 0) dc = 0; - if (dc > 0 && explicit_deps_raw == nullptr) dc = 0; - int needed = dep_gen_records_needed_for(dc); - - rmb(); - uint64_t cur_ptr = s_dep_gen_state->current_buf_ptr; - if (cur_ptr == 0) { - DepGenBuffer *recovered = try_pop_dep_gen_buffer(s_dep_gen_state->current_buf_seq); - if (recovered == nullptr) { - s_dep_gen_state->dropped_record_count += 1; - wmb(); - return; - } - cur_ptr = s_dep_gen_state->current_buf_ptr; - } - DepGenBuffer *buf = reinterpret_cast(cur_ptr); - - // Snapshot the count from volatile shared memory into a local so capacity - // math, base-record idx, and the final publish all use the same value. - // Single-writer ownership means a re-read would return the same value - // today, but a local snapshot makes the invariant explicit and is also - // a guardrail if a future device-side actor ever races count. - uint32_t local_count = buf->count; - - // Reserve the whole chain up front. If it won't fit in the current - // buffer, switch first (skipping the switch when the current buffer is - // already empty — switching would just enqueue a zero-record buffer and - // pop a fresh one we'd truncate into anyway). Then, regardless of whether - // we switched, if the chain still won't fit (chain larger than the - // buffer), cap dc to what the buffer can hold and log truncation. - if (local_count > 0 && - local_count + static_cast(needed) > static_cast(PLATFORM_DEP_GEN_RECORDS_PER_BUFFER)) { - dep_gen_switch_buffer(); - rmb(); - cur_ptr = s_dep_gen_state->current_buf_ptr; - if (cur_ptr == 0) { - DepGenBuffer *recovered = try_pop_dep_gen_buffer(s_dep_gen_state->current_buf_seq); - if (recovered == nullptr) { - s_dep_gen_state->dropped_record_count += 1; - wmb(); - return; - } - cur_ptr = s_dep_gen_state->current_buf_ptr; - } - buf = reinterpret_cast(cur_ptr); - local_count = buf->count; // refresh after switch — new buffer starts at 0 - } - - const int capacity = PLATFORM_DEP_GEN_RECORDS_PER_BUFFER - static_cast(local_count); - if (capacity <= 0) { - // local_count is bounded by the previous writer's publish step, so - // this is only reachable if shared memory was corrupted out from - // under us. Drop the record and bail rather than write past the end - // of buf->records[]. - LOG_ERROR("dep_gen: invalid capacity %d (local_count=%u), dropping record", capacity, local_count); - s_dep_gen_state->dropped_record_count += 1; - wmb(); - return; - } - if (needed > capacity) { - // Compute the largest dc that fits in `capacity` slots. - int dc_fit = DEP_GEN_MAX_EXPLICIT_DEPS + (capacity - 1) * DEP_GEN_OVERFLOW_DEPS_PER_RECORD; - LOG_ERROR( - "dep_gen: chain (%d records for %d deps) exceeds buffer capacity (%d slots), truncating to %d deps", needed, - dc, capacity, dc_fit - ); - dc = dc_fit; - needed = dep_gen_records_needed_for(dc); - } - - int tc = tensor_count; - if (tc < 0) { - tc = 0; - } else if (tc > CORE_MAX_TENSOR_ARGS) { - // The runtime's Arg also caps at CORE_MAX_TENSOR_ARGS, so this should - // never trip; clamp defensively to keep the writer crash-free. - LOG_ERROR("dep_gen: tensor_count %d > CORE_MAX_TENSOR_ARGS (%d), truncating", tc, CORE_MAX_TENSOR_ARGS); - tc = CORE_MAX_TENSOR_ARGS; - } - - // ---- Write base record ---- - uint32_t idx = local_count; - DepGenRecord *rec = &buf->records[idx]; - - rec->task_id = task_id_raw; - // Cast the enum to uint32_t before the ternary so Linux GCC's -Wextra - // does not warn about "enumerated and non-enumerated type in conditional". - uint32_t base_flags = in_manual_scope ? static_cast(DEP_GEN_FLAG_IN_MANUAL_SCOPE) : 0u; - if (needed > 1) { - base_flags |= static_cast(DEP_GEN_FLAG_HAS_OVERFLOW); - } - rec->flags = base_flags; - rec->tensor_count = static_cast(tc); - rec->block_num = block_num > 0 ? static_cast(block_num) : 1u; - - int base_dc = (dc < DEP_GEN_MAX_EXPLICIT_DEPS) ? dc : DEP_GEN_MAX_EXPLICIT_DEPS; - rec->explicit_dep_count = static_cast(base_dc); - - // explicit_deps (tail of the entry, packed; replay reads only the first base_dc entries) - if (base_dc > 0) { - memcpy(rec->explicit_deps, explicit_deps_raw, static_cast(base_dc) * sizeof(uint64_t)); - } - - // arg_types - if (tc > 0 && arg_types != nullptr) { - memcpy(rec->arg_types, arg_types, static_cast(tc)); - } - - // Per-subslot kernel ids (AIC, AIV0, AIV1). The orchestrator owns the - // identity-side of the swimlane join: with task_id (PTO2 raw) + kernel_id - // captured here, the host post-processor can name every AICore record. - // Inactive subslots stay at INVALID_KERNEL_ID (-1); the caller is expected - // to pass that sentinel rather than 0. - if (kernel_ids != nullptr) { - rec->kernel_id[0] = kernel_ids[0]; - rec->kernel_id[1] = kernel_ids[1]; - rec->kernel_id[2] = kernel_ids[2]; - } else { - rec->kernel_id[0] = -1; - rec->kernel_id[1] = -1; - rec->kernel_id[2] = -1; - } - - // tensors[]: per-slot 128-byte blob (or zero if pointer is null — OUTPUT slot) - if (tc > 0) { - if (tensor_ptrs == nullptr) { - memset(rec->tensors, 0, static_cast(tc) * DEP_GEN_TENSOR_SIZE); - } else { - for (int i = 0; i < tc; i++) { - if (tensor_ptrs[i] == nullptr) { - memset(rec->tensors[i], 0, DEP_GEN_TENSOR_SIZE); - } else { - memcpy(rec->tensors[i], tensor_ptrs[i], DEP_GEN_TENSOR_SIZE); - } - } - } - } - - // ---- Write overflow chain ---- - // Charge each overflow slot to total_overflow_record_count so the host's - // reconciliation equation (`collected + dropped == total + total_overflow`) - // accounts for chain expansion. total_record_count stays "one per submit" - // — see DepGenBufferState doc. - if (needed > 1) { - s_dep_gen_state->total_overflow_record_count += static_cast(needed - 1); - } - int written = base_dc; - for (int slot = 1; slot < needed; slot++) { - auto *over = reinterpret_cast(&buf->records[idx + static_cast(slot)]); - over->task_id = task_id_raw; - const int chunk = - ((dc - written) < DEP_GEN_OVERFLOW_DEPS_PER_RECORD) ? (dc - written) : DEP_GEN_OVERFLOW_DEPS_PER_RECORD; - const bool is_last = (slot == needed - 1); - uint32_t over_flags = static_cast(DEP_GEN_FLAG_OVERFLOW); - if (is_last) { - over_flags |= static_cast(DEP_GEN_FLAG_LAST_OVERFLOW); - } - over->flags = over_flags; - over->dep_count = static_cast(chunk); - over->_reserved = 0; - if (chunk > 0) { - memcpy(over->deps, explicit_deps_raw + written, static_cast(chunk) * sizeof(uint64_t)); - } - written += chunk; - } - - // Publish all reserved slots atomically — host either sees the old count - // (chain invisible) or the new count with the full chain committed. The - // single trailing wmb() flushes both the record payloads and the count - // store, matching the pre-chain contract. - buf->count = idx + static_cast(needed); - wmb(); -} - -void dep_gen_aicpu_flush() { - if (s_dep_gen_header == nullptr || s_dep_gen_state == nullptr) { - return; - } - - rmb(); - uint64_t buf_ptr = s_dep_gen_state->current_buf_ptr; - if (buf_ptr == 0) { - return; - } - DepGenBuffer *buf = reinterpret_cast(buf_ptr); - if (buf->count == 0) { - return; - } - - uint32_t seq = s_dep_gen_state->current_buf_seq; - int rc = enqueue_dep_gen_ready_buffer(buf_ptr, seq); - if (rc == 0) { - LOG_INFO_V0("dep_gen: flushed buffer with %u records", buf->count); - s_dep_gen_state->current_buf_ptr = 0; - wmb(); - } else { - LOG_ERROR("dep_gen: flush failed (ready_queue full), %u records dropped", buf->count); - s_dep_gen_state->dropped_record_count += buf->count; - buf->count = 0; - s_dep_gen_state->current_buf_ptr = 0; - wmb(); - } -} - -void dep_gen_aicpu_finalize() { - // No HW state to restore (unlike PMU). Reset file-local cache for cleanliness - // — the next init re-resolves these from the (potentially new) base anyway. - s_dep_gen_header = nullptr; - s_dep_gen_state = nullptr; - s_orch_thread_idx = -1; -} diff --git a/src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp b/src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp deleted file mode 100644 index 260374832..000000000 --- a/src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp +++ /dev/null @@ -1,1016 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file l2_swimlane_collector_aicpu.cpp - * @brief AICPU performance data collection implementation (SPSC free queue) - * - * Uses per-core L2SwimlaneAicpuTaskPool with SPSC free queues for O(1) buffer switching. - * Host memory manager dynamically allocates replacement buffers and pushes - * them into the free_queue. Device pops from free_queue when switching. - */ - -#include "aicpu/l2_swimlane_collector_aicpu.h" - -#include -#include - -#include "aicpu/platform_regs.h" -#include "common/memory_barrier.h" -#include "common/platform_config.h" -#include "common/unified_log.h" - -// Cached pointers for hot-path access (set during init). Phase metadata -// (num_sched_phase_threads, num_orch_phase_threads, num_phase_cores, -// core_to_thread[]) lives inside L2SwimlaneDataHeader after the phase-header -// merge; we keep a separate bool so phase-gated paths can check init-ran -// without re-reading the device-shared header. -static L2SwimlaneDataHeader *s_l2_swimlane_header = nullptr; -static bool s_phase_initialized = false; - -// Per-core L2SwimlaneAicpuTaskPool cache -static L2SwimlaneAicpuTaskPool *s_aicpu_task_pools[PLATFORM_MAX_CORES] = {}; - -// Per-core L2SwimlaneAicoreTaskPool cache (lives in the same shared region; -// host writes initial pool + the rotation channel that AICore polls). -// -// All AICore-side bookkeeping (rotation channel, free queue, -// total_record_count, current_buf_seq) is owned by this shared struct — see -// l2_swimlane_profiling.h. We deliberately do not keep AICPU-process-local -// mirror counters because the struct's volatile fields are the single -// source of truth across init/complete/rotate/flush. The high-water-mark -// formula `total_record_count - current_buf_seq * BUFFER_SIZE` correctly -// handles the failed-rotation case (free_queue empty or ready_queue full) -// since current_buf_seq only bumps on a successful rotation. -static L2SwimlaneAicoreTaskPool *s_aicore_task_pools[PLATFORM_MAX_CORES] = {}; - -// Per-core AICPU-side dispatch count. Incremented on every -// `l2_swimlane_aicpu_on_aicore_dispatch` call (= once per AICore dispatch). -// When the pre-bump value is a non-zero multiple of PLATFORM_AICORE_BUFFER_SIZE, -// AICPU rotates the AICore buffer before the upcoming write_reg(DATA_MAIN_BASE). -// Single-writer per cell (the scheduler thread that owns the core). -static uint32_t s_aicore_dispatched_count[PLATFORM_MAX_CORES] = {}; - -// Per-core cached current-records-buffer pointer. Written by AICPU when -// rotating buffers from inside `complete_record`. AICore writes to its own -// per-core L2SwimlaneAicoreTaskBuffer (host-allocated, AICPU rotates) and AICPU -// never reads from it on the hot path. -static L2SwimlaneAicpuTaskBuffer *s_current_aicpu_task_buffers[PLATFORM_MAX_CORES] = {}; - -// Per-thread sched-phase pool/buffer caches (per-scheduler-thread) -static L2SwimlaneAicpuSchedPhasePool *s_sched_phase_pools[PLATFORM_MAX_AICPU_THREADS] = {}; -static L2SwimlaneAicpuSchedPhaseBuffer *s_current_sched_phase_buffers[PLATFORM_MAX_AICPU_THREADS] = {}; - -// Per-thread orch-phase pool/buffer caches (one orch thread). -static L2SwimlaneAicpuOrchPhasePool *s_orch_phase_pools[PLATFORM_MAX_AICPU_THREADS] = {}; -static L2SwimlaneAicpuOrchPhaseBuffer *s_current_orch_phase_buffers[PLATFORM_MAX_AICPU_THREADS] = {}; - -static int s_orch_thread_idx = -1; - -// L2 swimlane platform state. Published by the host (via dlsym'd setters on sim) -// or by the AICPU kernel entry (onboard) before perf init runs, so downstream -// perf code can discover enablement + device-base without reading the generic -// Runtime struct. Two channels (mirrors PMU): -// - g_enable_l2_swimlane (bool) — set at kernel entry from the bitmask bit -// - g_l2_swimlane_level (L2SwimlaneLevel) — promoted in -// l2_swimlane_aicpu_init from the shared-memory header so -// `>= AICPU_TIMING / SCHED_PHASES / ORCH_PHASES` gates have the granular -// value (exposed via get_l2_swimlane_level()). -static uint64_t g_platform_l2_swimlane_base = 0; -static bool g_enable_l2_swimlane = false; -static L2SwimlaneLevel g_l2_swimlane_level = L2SwimlaneLevel::DISABLED; - -// AICore rotation-table device pointer (= KernelArgs::l2_swimlane_aicore_rotation_table). -// Published by the host (sim: dlsym'd setter; onboard: from k_args via the -// kernel entry); AICPU init walks it to fill per-core &rotation addresses. -static uint64_t g_platform_l2_swimlane_aicore_rotation_table = 0; - -extern "C" void set_platform_l2_swimlane_base(uint64_t l2_swimlane_data_base) { - g_platform_l2_swimlane_base = l2_swimlane_data_base; -} -extern "C" uint64_t get_platform_l2_swimlane_base() { return g_platform_l2_swimlane_base; } -extern "C" void set_l2_swimlane_enabled(bool enable) { g_enable_l2_swimlane = enable; } -extern "C" bool is_l2_swimlane_enabled() { return g_enable_l2_swimlane; } -extern "C" void set_platform_l2_swimlane_aicore_rotation_table(uint64_t table_addr) { - g_platform_l2_swimlane_aicore_rotation_table = table_addr; -} -extern "C" uint64_t get_platform_l2_swimlane_aicore_rotation_table() { - return g_platform_l2_swimlane_aicore_rotation_table; -} -L2SwimlaneLevel get_l2_swimlane_level() { return g_l2_swimlane_level; } - -static constexpr uint64_t kL2SwimlaneQueueBackpressureWaitCycles = PLATFORM_PROF_SYS_CNT_FREQ / 50000; // 20 us - -static bool -wait_for_ready_queue_space(L2SwimlaneDataHeader *header, int thread_idx, uint32_t *tail_out, uint32_t *head_out) { - if (header == nullptr || thread_idx < 0 || thread_idx >= PLATFORM_MAX_AICPU_THREADS) { - return false; - } - const uint32_t capacity = PLATFORM_PROF_READYQUEUE_SIZE; - const uint64_t start = get_sys_cnt_aicpu(); - - do { - uint32_t current_tail = header->queue_tails[thread_idx]; - uint32_t current_head = header->queue_heads[thread_idx]; - uint32_t next_tail = (current_tail + 1) % capacity; - if (next_tail != current_head) { - *tail_out = current_tail; - *head_out = current_head; - return true; - } - if (get_sys_cnt_aicpu() - start >= kL2SwimlaneQueueBackpressureWaitCycles) { - break; - } - } while (true); - return false; -} - -static bool wait_for_free_queue_entry(L2SwimlaneFreeQueue *free_queue, uint32_t *head_out, uint32_t *tail_out) { - if (free_queue == nullptr) { - return false; - } - const uint64_t start = get_sys_cnt_aicpu(); - - do { - uint32_t head = free_queue->head; - uint32_t tail = free_queue->tail; - if (head != tail) { - *head_out = head; - *tail_out = tail; - rmb(); // acquire: order the tail read above before the caller's buffer_ptrs read - return true; - } - if (get_sys_cnt_aicpu() - start >= kL2SwimlaneQueueBackpressureWaitCycles) { - break; - } - } while (true); - return false; -} - -/** - * Enqueue ready buffer to per-thread queue - * - * @param header L2SwimlaneDataHeader pointer - * @param thread_idx AICPU thread index (selects the per-thread ready queue) - * @param core_index Core index for task entries, or pool ordinal for phase entries - * @param buffer_ptr Device pointer to the full buffer - * @param buffer_seq Sequence number for ordering - * @param kind Buffer kind discriminator (see L2SwimlaneBufferKind) - * @return 0 on success, -1 if queue full - */ -static int enqueue_ready_buffer( - L2SwimlaneDataHeader *header, int thread_idx, uint32_t core_index, uint64_t buffer_ptr, uint32_t buffer_seq, - L2SwimlaneBufferKind kind -) { - uint32_t capacity = PLATFORM_PROF_READYQUEUE_SIZE; - uint32_t current_tail = 0; - uint32_t current_head = 0; - - if (!wait_for_ready_queue_space(header, thread_idx, ¤t_tail, ¤t_head)) { - return -1; - } - uint32_t next_tail = (current_tail + 1) % capacity; - - header->queues[thread_idx][current_tail].core_index = core_index; - header->queues[thread_idx][current_tail].kind = kind; - header->queues[thread_idx][current_tail].buffer_ptr = buffer_ptr; - header->queues[thread_idx][current_tail].buffer_seq = buffer_seq; - wmb(); // publish: entry fields visible before the tail advance - header->queue_tails[thread_idx] = next_tail; - - return 0; -} - -static L2SwimlaneAicpuTaskBuffer * -try_pop_records_buffer(int core_id, L2SwimlaneAicpuTaskPool *state, uint32_t next_seq) { - uint32_t head = 0; - uint32_t tail = 0; - if (!wait_for_free_queue_entry(&state->free_queue, &head, &tail)) { - return nullptr; - } - - uint64_t new_buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - state->free_queue.head = head + 1; - if (new_buf_ptr == 0) { - return nullptr; - } - - auto *new_buf = reinterpret_cast(new_buf_ptr); - new_buf->count = 0; - wmb(); - - state->head.current_buf_ptr = new_buf_ptr; - state->head.current_buf_seq = next_seq; - s_current_aicpu_task_buffers[core_id] = new_buf; - wmb(); - return new_buf; -} - -void l2_swimlane_aicpu_init(int worker_count) { - // Reset cross-launch state up front. AICPU statics persist across launches - // on the same loaded .so; without this reset, an enabled→disabled launch - // sequence would leave s_phase_initialized=true from the prior run, and - // any subsequent record_sched_phase / record_orch_phase call would - // dereference the prior launch's (now-freed) s_sched_phase_pools / - // s_orch_phase_pools pointers. Same shape as the [[block_local]] reset - // in onboard/aicore/kernel.cpp for the AICore-side rotation slot - // (fixed in #936). - s_phase_initialized = false; - - // Reset AICore dispatch-count bookkeeping for the same reason: the next - // launch must start counting from 0 so the rotation boundary check - // (count % BUFFER_SIZE == 0) lands on the right dispatches. Stale values - // from a prior launch would skip the first rotation (count already past a - // boundary) or trigger one prematurely. - for (int i = 0; i < PLATFORM_MAX_CORES; i++) { - s_aicore_dispatched_count[i] = 0; - } - - void *l2_swimlane_base = reinterpret_cast(g_platform_l2_swimlane_base); - if (l2_swimlane_base == nullptr) { - LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize profiling"); - return; - } - - s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base); - - // Read the granular perf_level from the shared-memory header (host wrote - // it in L2SwimlaneCollector::initialize). The kernel-entry setter only seeded - // the binary g_enable_l2_swimlane via the bitmask bit. - g_l2_swimlane_level = static_cast(s_l2_swimlane_header->l2_swimlane_level); - - LOG_INFO_V0( - "Initializing performance profiling for %d cores (free queue), l2_swimlane_level=%u", worker_count, - static_cast(g_l2_swimlane_level) - ); - - // Populate the per-core AICore head device-address table. AICore reads - // `l2_swimlane_aicore_rotation_table[block_idx]` from KernelArgs to find - // its `L2SwimlaneActiveHead` cache line; the table itself is - // host-allocated, but the entries are device-internal addresses - // (`&ac_state->head`) that the host would otherwise have to translate - // from host-mapped to device-mapped. AICPU already runs on the device, - // so it can write the addresses directly without any translation — that - // keeps the host side decoupled from the AICore shared-memory layout. - uint64_t *head_table = reinterpret_cast(g_platform_l2_swimlane_aicore_rotation_table); - - // Pop first buffer from free_queue for each core - for (int i = 0; i < worker_count; i++) { - L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(l2_swimlane_base, i); - L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(l2_swimlane_base, worker_count, i); - - s_aicpu_task_pools[i] = state; - s_aicore_task_pools[i] = ac_state; - - if (head_table != nullptr) { - head_table[i] = reinterpret_cast(&ac_state->head); - } - - // Pop first buffer from free_queue - rmb(); - uint32_t head = state->free_queue.head; - uint32_t tail = state->free_queue.tail; - - if (head != tail) { - uint64_t buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - state->free_queue.head = head + 1; - state->head.current_buf_ptr = buf_ptr; - state->head.current_buf_seq = 0; - wmb(); - - L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast(buf_ptr); - buf->count = 0; - s_current_aicpu_task_buffers[i] = buf; - - LOG_DEBUG("Core %d: popped initial buffer (addr=0x%lx)", i, buf_ptr); - } else { - LOG_ERROR("Core %d: free_queue is empty during init!", i); - state->head.current_buf_ptr = 0; - s_current_aicpu_task_buffers[i] = nullptr; - } - - // Prime the AICore head channel with the initial buffer. Seq starts - // at 0; AICore's local `cached_buf_seq` defaults to UINT32_MAX so the - // first record_task call observes a mismatch and loads the buffer. - rmb(); - uint32_t ac_head = ac_state->free_queue.head; - uint32_t ac_tail = ac_state->free_queue.tail; - if (ac_head != ac_tail) { - uint64_t ac_buf_ptr = ac_state->free_queue.buffer_ptrs[ac_head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - ac_state->free_queue.head = ac_head + 1; - // Same publish pattern as aicore_rotate: ptr first, then a fence, - // then seq. AICore lazy-resolves the head on its first task, so - // strict ordering here matters only if AICore is ever changed to - // start polling before the first dispatch — keeping the patterns - // aligned future-proofs that. - ac_state->head.current_buf_ptr = ac_buf_ptr; - wmb(); - ac_state->head.current_buf_seq = 0; - wmb(); - L2SwimlaneAicoreTaskBuffer *ac_buf = reinterpret_cast(ac_buf_ptr); - ac_buf->count = 0; - LOG_DEBUG("Core %d: primed AICore head with buf=0x%lx, seq=0", i, ac_buf_ptr); - } else { - LOG_ERROR("Core %d: AICore free_queue is empty during init!", i); - ac_state->head.current_buf_ptr = 0; - ac_state->head.current_buf_seq = 0; - wmb(); - } - } - - wmb(); - - LOG_INFO_V0("Performance profiling initialized for %d cores (with AICore rotation)", worker_count); -} - -/** - * Internal records-buffer rotation. Called from `l2_swimlane_aicpu_complete_task` - * after a record is committed and the buffer hits capacity. Only swaps an - * AICPU-private records pointer — AICore reads from a stable ring and is - * unaffected by this call. - */ -static void switch_records_buffer(int core_id, int thread_idx) { - L2SwimlaneAicpuTaskPool *state = s_aicpu_task_pools[core_id]; - if (state == nullptr) { - return; - } - - L2SwimlaneAicpuTaskBuffer *full_buf = s_current_aicpu_task_buffers[core_id]; - if (full_buf == nullptr) { - return; - } - - LOG_INFO_V0("Thread %d: Core %d buffer is full (count=%u)", thread_idx, core_id, full_buf->count); - - uint32_t seq = state->head.current_buf_seq; - uint64_t full_buf_ptr = state->head.current_buf_ptr; - int rc = enqueue_ready_buffer( - s_l2_swimlane_header, thread_idx, core_id, full_buf_ptr, seq, L2SwimlaneBufferKind::AicpuTask - ); - if (rc != 0) { - LOG_ERROR("Thread %d: Core %d failed to enqueue buffer (queue full), data lost!", thread_idx, core_id); - state->head.dropped_record_count = state->head.dropped_record_count + full_buf->count; - full_buf->count = 0; - wmb(); - return; - } - - uint32_t next_seq = seq + 1; - state->head.current_buf_ptr = 0; - state->head.current_buf_seq = next_seq; - s_current_aicpu_task_buffers[core_id] = nullptr; - wmb(); - - L2SwimlaneAicpuTaskBuffer *new_buf = try_pop_records_buffer(core_id, state, next_seq); - if (new_buf == nullptr) { - return; - } - - LOG_INFO_V0( - "Thread %d: Core %d switched to new buffer (addr=0x%lx)", thread_idx, core_id, - reinterpret_cast(new_buf) - ); -} - -// Try to rotate the AICore buffer for `core_id`. Called from the completion -// path after a successful L2SwimlaneAicpuTaskRecord commit so the just-FIN'd task's -// AICore record is guaranteed to be in the old buffer before we enqueue it. -// On success bumps `ac_state->head.current_buf_seq`; on failure (empty free queue -// or full ready queue) the old buffer is abandoned in place, AICore overflows -// it from now on, and the drop count grows. -static void aicore_rotate(int core_id, int thread_idx) { - L2SwimlaneAicoreTaskPool *ac_state = s_aicore_task_pools[core_id]; - if (ac_state == nullptr) { - return; - } - - uint64_t old_buf_ptr = ac_state->head.current_buf_ptr; - uint32_t seq = ac_state->head.current_buf_seq; - - uint32_t head = 0; - uint32_t tail = 0; - if (!wait_for_free_queue_entry(&ac_state->free_queue, &head, &tail)) { - // No replacement available — AICore continues to write into the old - // buffer; its slot counter will hit BUFFER_SIZE and the slot guard - // silently drops further records. We deliberately do NOT bump - // dropped_record_count here: AICPU has no precise view of how many - // tasks will actually fall in this gap before the run ends. The - // pre-emptive BUFFER_SIZE bump that used to live here over-counted - // when the run ended early — the old buffer's already-written - // records still flushed (counted toward `collected`), and the - // pre-emptive bump on top of that broke the - // `collected + dropped == total` reconcile invariant. The drop is - // visible at reconcile time as silent loss - // (`total - collected - dropped > 0`) and the WARN below records - // the failure mode. - LOG_WARN( - "Thread %d: Core %d AICore free_queue empty at rotation; AICore slot guard will drop overflow records", - thread_idx, core_id - ); - return; - } - - uint64_t new_buf_ptr = ac_state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - if (new_buf_ptr == 0) { - LOG_WARN( - "Thread %d: Core %d AICore free_queue returned a null buffer at rotation; keeping old buffer active", - thread_idx, core_id - ); - return; - } - - // Enqueue the just-filled AICore buffer with count = BUFFER_SIZE. - if (old_buf_ptr != 0) { - L2SwimlaneAicoreTaskBuffer *old_buf = reinterpret_cast(old_buf_ptr); - old_buf->count = static_cast(PLATFORM_AICORE_BUFFER_SIZE); - wmb(); - int rc = enqueue_ready_buffer( - s_l2_swimlane_header, thread_idx, core_id, old_buf_ptr, seq, L2SwimlaneBufferKind::AicoreTask - ); - if (rc != 0) { - // Ready queue full — we leave current_buf_ptr pointing at the - // old buffer so the run-end flush path retries the enqueue (the - // host is draining concurrently; the queue may have space by - // then). We deliberately do NOT bump dropped here for the same - // reason as the empty-free-queue branch: counting a drop now - // would double-count if the flush succeeds in delivering the - // buffer to the host. Reconcile reports the actual loss as - // silent_loss when neither this rotation nor the flush - // delivers the records. - LOG_ERROR( - "Thread %d: Core %d failed to enqueue AICore buffer at rotation (queue full); will retry at flush", - thread_idx, core_id - ); - return; - } - } - - // Pop next buffer from free_queue and publish via the head channel. - // Publish order matters: AICore observes head.current_buf_seq change to - // detect rotation, then reads head.current_buf_ptr. Write ptr first so - // AICore can never see a new seq with a stale ptr. new_buf->count=0 must - // also be visible before AICore's slot writes begin. - ac_state->free_queue.head = head + 1; - L2SwimlaneAicoreTaskBuffer *new_buf = reinterpret_cast(new_buf_ptr); - new_buf->count = 0; - - wmb(); - ac_state->head.current_buf_ptr = new_buf_ptr; - wmb(); - ac_state->head.current_buf_seq = seq + 1; - wmb(); -} - -// Pre-dispatch hook. Called from the dispatch path (scheduler_dispatch in -// tensormap_and_ringbuffer; aicpu_executor in host_build_graph) immediately -// before `write_reg(DATA_MAIN_BASE)` for each AICore task. Maintains the -// per-core dispatch count and rotates the AICore buffer when the count is -// about to cross a PLATFORM_AICORE_BUFFER_SIZE boundary. -// -// Race safety: rotation runs before the dispatch register write. The -// completion-before-dispatch invariant (AICore per core is single-threaded -// and AICPU does not dispatch task K+1 until K FIN'd) guarantees AICore has -// already finished writing — and dcci'd out — every record in the old buffer -// by then. AICPU can safely enqueue the old buffer to the ready queue. -// -// total_record_count accounting also lives here: one AICore record == one -// dispatch, so the dispatch count IS the AICore-side total. Bumping here -// (instead of inside complete_task) means level=1 (AICORE_TIMING-only) gets -// accurate reconcile counts even when complete_task is bypassed. -void l2_swimlane_aicpu_on_aicore_dispatch(int core_id, int thread_idx) { - if (!g_enable_l2_swimlane) { - return; - } - if (core_id < 0 || core_id >= PLATFORM_MAX_CORES) { - return; - } - L2SwimlaneAicoreTaskPool *ac_state = s_aicore_task_pools[core_id]; - if (ac_state == nullptr) { - return; - } - uint32_t prev = s_aicore_dispatched_count[core_id]; - // Rotate exactly on the first dispatch of each non-initial BUFFER_SIZE - // batch (prev = BUFFER_SIZE, 2*BUFFER_SIZE, ...). PLATFORM_AICORE_BUFFER_SIZE - // is asserted power-of-two so the mod lowers to a bitwise AND. - if (prev > 0 && (prev & (PLATFORM_AICORE_BUFFER_SIZE - 1)) == 0) { - aicore_rotate(core_id, thread_idx); - } - s_aicore_dispatched_count[core_id] = prev + 1; - ac_state->head.total_record_count += 1; -} - -int l2_swimlane_aicpu_complete_task( - int core_id, int thread_idx, uint32_t reg_task_id, uint64_t dispatch_time, uint64_t finish_time -) { - if (core_id < 0 || core_id >= PLATFORM_MAX_CORES) { - return -1; - } - L2SwimlaneAicpuTaskPool *state = s_aicpu_task_pools[core_id]; - if (state == nullptr) { - return -1; - } - - // Account every commit attempt up front so host can detect silent loss as - // `device_total - (collected + dropped)`. - state->head.total_record_count += 1; - - L2SwimlaneAicpuTaskBuffer *l2_swimlane_buf = s_current_aicpu_task_buffers[core_id]; - if (l2_swimlane_buf == nullptr) { - l2_swimlane_buf = try_pop_records_buffer(core_id, state, state->head.current_buf_seq); - if (l2_swimlane_buf == nullptr) { - // No active records buffer (init ran out of free buffers or host has - // not refilled after the last published full buffer); count as drop - // so host reconciliation stays consistent. - state->head.dropped_record_count += 1; - return -1; - } - } - uint32_t count = l2_swimlane_buf->count; - if (count >= PLATFORM_PROF_BUFFER_SIZE) { - // Defensive: should not happen because we rotate at end of every commit. - state->head.dropped_record_count += 1; - return -1; - } - - // AICPU-only timing — three fields, two cache half-lines. Identity - // (task_token_raw, core_type) lives in the AICore record; the host - // joins by reg_task_id. See L2SwimlaneAicpuTaskRecord header comment. - L2SwimlaneAicpuTaskRecord *record = &l2_swimlane_buf->records[count]; - record->reg_task_id = reg_task_id; - record->dispatch_time = dispatch_time; - record->finish_time = finish_time; - - uint32_t new_count = count + 1; - l2_swimlane_buf->count = new_count; - wmb(); - - // Rotate AICpu's L2SwimlaneAicpuTaskBuffer after the write so the just-committed - // record is preserved. - if (new_count >= PLATFORM_PROF_BUFFER_SIZE) { - switch_records_buffer(core_id, thread_idx); - } - - // AICore-pool stats (total_record_count) are bumped on the dispatch side, - // not here. See l2_swimlane_aicpu_on_aicore_dispatch — counting per - // dispatch keeps reconcile counts accurate even at level=1 where this - // function never runs. - return 0; -} - -void l2_swimlane_aicpu_flush(int thread_idx, const int *cur_thread_cores, int core_num) { - if (!g_enable_l2_swimlane) { - return; - } - - void *l2_swimlane_base = reinterpret_cast(g_platform_l2_swimlane_base); - if (l2_swimlane_base == nullptr) { - return; - } - - rmb(); - - LOG_INFO_V0("Thread %d: Flushing performance buffers for %d cores", thread_idx, core_num); - - int flushed_count = 0; - - for (int i = 0; i < core_num; i++) { - int core_id = cur_thread_cores[i]; - L2SwimlaneAicpuTaskPool *state = s_aicpu_task_pools[core_id]; - if (state == nullptr) continue; - - rmb(); - uint64_t buf_ptr = state->head.current_buf_ptr; - if (buf_ptr == 0) { - // No active buffer - } else { - L2SwimlaneAicpuTaskBuffer *buf = reinterpret_cast(buf_ptr); - if (buf->count > 0) { - uint32_t seq = state->head.current_buf_seq; - int rc = enqueue_ready_buffer( - s_l2_swimlane_header, thread_idx, core_id, buf_ptr, seq, L2SwimlaneBufferKind::AicpuTask - ); - if (rc == 0) { - LOG_INFO_V0("Thread %d: Core %d flushed buffer with %u records", thread_idx, core_id, buf->count); - flushed_count++; - state->head.current_buf_ptr = 0; - s_current_aicpu_task_buffers[core_id] = nullptr; - wmb(); - } else { - // ready_queue full at end-of-run: account the loss and clear the - // buffer so host reconcile sees a clean state (current_buf_ptr=0) - // and dropped == flush failures rather than ring/task_id mismatch. - LOG_ERROR( - "Thread %d: Core %d failed to enqueue buffer (queue full), %u records lost!", thread_idx, - core_id, buf->count - ); - state->head.dropped_record_count = state->head.dropped_record_count + buf->count; - buf->count = 0; - state->head.current_buf_ptr = 0; - s_current_aicpu_task_buffers[core_id] = nullptr; - wmb(); - } - } - } - - // Also flush the current AICore buffer to the ready queue so the host - // sees this session's final batch of AICore timestamps. - // - // High-water mark uses the rotation accounting (total_record_count - - // current_buf_seq * BUFFER_SIZE). total_record_count is bumped per - // dispatch in l2_swimlane_aicpu_on_aicore_dispatch and is therefore - // accurate at all levels — including level=1 where complete_task is - // bypassed. The formula clamps to BUFFER_SIZE if an earlier rotation - // failed (no free buffer), so we never stamp a partial count when - // the buffer is actually full. - L2SwimlaneAicoreTaskPool *ac_state = s_aicore_task_pools[core_id]; - if (ac_state == nullptr) continue; - - rmb(); - uint64_t ac_buf_ptr = ac_state->head.current_buf_ptr; - if (ac_buf_ptr == 0) continue; - - // At AICPU_TIMING+, `total_record_count` is bumped on every complete - // and gives an accurate live count for the current buffer. At - // AICORE_TIMING (level=1) complete_task is skipped, so that counter - // stays 0 and the formula bails even when AICore has filled records. - // Fall back to the buffer's full capacity in that case; the host-side - // copy_aicore_buffer skips trailing slots whose start_time is still 0, - // so over-stating count costs only a scan pass — never spurious records. - uint32_t ac_mark; - if (g_l2_swimlane_level >= L2SwimlaneLevel::AICPU_TIMING) { - uint32_t live = ac_state->head.total_record_count - - ac_state->head.current_buf_seq * static_cast(PLATFORM_AICORE_BUFFER_SIZE); - if (live == 0) { - continue; - } - ac_mark = (live > static_cast(PLATFORM_AICORE_BUFFER_SIZE)) ? - static_cast(PLATFORM_AICORE_BUFFER_SIZE) : - live; - } else { - ac_mark = static_cast(PLATFORM_AICORE_BUFFER_SIZE); - } - L2SwimlaneAicoreTaskBuffer *ac_buf = reinterpret_cast(ac_buf_ptr); - ac_buf->count = ac_mark; - wmb(); - - uint32_t ac_seq = ac_state->head.current_buf_seq; - int rc = enqueue_ready_buffer( - s_l2_swimlane_header, thread_idx, core_id, ac_buf_ptr, ac_seq, L2SwimlaneBufferKind::AicoreTask - ); - if (rc == 0) { - LOG_INFO_V0( - "Thread %d: Core %d flushed AICore buffer (seq=%u, count=%u)", thread_idx, core_id, ac_seq, ac_mark - ); - ac_state->head.current_buf_ptr = 0; - wmb(); - } else { - LOG_ERROR("Thread %d: Core %d failed to enqueue AICore buffer at flush (queue full)", thread_idx, core_id); - ac_state->head.dropped_record_count = ac_state->head.dropped_record_count + ac_mark; - ac_state->head.current_buf_ptr = 0; - wmb(); - } - } - - wmb(); - - LOG_INFO_V0("Thread %d: Performance buffer flush complete, %d buffers flushed", thread_idx, flushed_count); -} - -// Pop the first buffer from a pool's free_queue and cache it as the current -// active buffer. Shared init helper for sched and orch phase pool priming. -// Returns the popped buffer ptr (nullptr if free_queue was empty). -template -static Buffer *prime_phase_pool(L2SwimlaneAicpuTaskPool *state, int thread_idx, const char *kind_label) { - rmb(); - uint32_t head = state->free_queue.head; - uint32_t tail = state->free_queue.tail; - - if (head != tail) { - uint64_t buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - state->free_queue.head = head + 1; - state->head.current_buf_ptr = buf_ptr; - state->head.current_buf_seq = 0; - wmb(); - - auto *buf = reinterpret_cast(buf_ptr); - buf->count = 0; - LOG_DEBUG("Thread %d: popped initial %s phase buffer (addr=0x%lx)", thread_idx, kind_label, buf_ptr); - return buf; - } - LOG_ERROR("Thread %d: %s phase free_queue is empty during init!", thread_idx, kind_label); - state->head.current_buf_ptr = 0; - return nullptr; -} - -void l2_swimlane_aicpu_init_phase(int worker_count, int num_sched_phase_threads, int num_orch_phase_threads) { - void *l2_swimlane_base = reinterpret_cast(g_platform_l2_swimlane_base); - if (l2_swimlane_base == nullptr) { - LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize phase profiling"); - return; - } - - s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base); - - s_l2_swimlane_header->num_sched_phase_threads = static_cast(num_sched_phase_threads); - s_l2_swimlane_header->num_orch_phase_threads = static_cast(num_orch_phase_threads); - s_l2_swimlane_header->num_phase_cores = 0; - memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread)); - s_phase_initialized = true; - - int sched_n = num_sched_phase_threads; - if (sched_n > PLATFORM_MAX_AICPU_THREADS) sched_n = PLATFORM_MAX_AICPU_THREADS; - int orch_n = num_orch_phase_threads; - if (orch_n > PLATFORM_MAX_AICPU_THREADS) orch_n = PLATFORM_MAX_AICPU_THREADS; - - for (int t = 0; t < sched_n; t++) { - auto *state = get_sched_phase_buffer_state(l2_swimlane_base, worker_count, t); - s_sched_phase_pools[t] = state; - s_current_sched_phase_buffers[t] = prime_phase_pool(state, t, "sched"); - } - for (int t = sched_n; t < PLATFORM_MAX_AICPU_THREADS; t++) { - s_sched_phase_pools[t] = nullptr; - s_current_sched_phase_buffers[t] = nullptr; - } - - for (int t = 0; t < orch_n; t++) { - auto *state = get_orch_phase_buffer_state(l2_swimlane_base, worker_count, t); - s_orch_phase_pools[t] = state; - s_current_orch_phase_buffers[t] = prime_phase_pool(state, t, "orch"); - } - for (int t = orch_n; t < PLATFORM_MAX_AICPU_THREADS; t++) { - s_orch_phase_pools[t] = nullptr; - s_current_orch_phase_buffers[t] = nullptr; - } - - wmb(); - - LOG_INFO_V0( - "Phase profiling initialized: %d sched threads, %d orch threads, %d records/thread", num_sched_phase_threads, - num_orch_phase_threads, PLATFORM_PHASE_RECORDS_PER_THREAD - ); -} - -// Generic phase-buffer switch. Enqueue the full buffer to its thread's -// ready queue under `kind`, then pop a fresh buffer from free_queue. Sets -// `*current_buf_out` to nullptr if no free buffer is available — subsequent -// records on that thread will drop until the host catches up. -// `thread_idx` is the AICPU thread doing the enqueue (always the caller); it -// selects that thread's own SPSC ready queue, which it must own exclusively. -// `pool_idx` is the pool ordinal the host uses to file records and recycle the -// buffer to that pool (the same ordinal indexes the output lane). For sched -// pools the two coincide (thread t → queue t, pool t); for the single orch -// instance they differ (orchestrator's thread, but pool ordinal 0). -template -static void switch_phase_buffer_kind( - int thread_idx, uint32_t pool_idx, L2SwimlaneAicpuTaskPool *state, Buffer **current_buf_out, - L2SwimlaneBufferKind kind, const char *kind_label -) { - Buffer *full_buf = *current_buf_out; - if (state == nullptr || full_buf == nullptr) return; - - LOG_INFO_V0("Thread %d: %s phase buffer is full (count=%u)", thread_idx, kind_label, full_buf->count); - - uint32_t seq = state->head.current_buf_seq; - int rc = enqueue_ready_buffer(s_l2_swimlane_header, thread_idx, pool_idx, state->head.current_buf_ptr, seq, kind); - if (rc != 0) { - LOG_ERROR( - "Thread %d: failed to enqueue %s phase buffer (queue full), %u records lost!", thread_idx, kind_label, - full_buf->count - ); - state->head.dropped_record_count += full_buf->count; - full_buf->count = 0; - wmb(); - return; - } - - uint32_t head = 0; - uint32_t tail = 0; - if (wait_for_free_queue_entry(&state->free_queue, &head, &tail)) { - uint64_t new_buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - state->free_queue.head = head + 1; - if (new_buf_ptr == 0) { - *current_buf_out = nullptr; - state->head.current_buf_ptr = 0; - wmb(); - return; - } - state->head.current_buf_ptr = new_buf_ptr; - state->head.current_buf_seq = seq + 1; - wmb(); - - Buffer *new_buf = reinterpret_cast(new_buf_ptr); - new_buf->count = 0; - *current_buf_out = new_buf; - LOG_INFO_V0("Thread %d: switched to new %s phase buffer", thread_idx, kind_label); - } else { - LOG_WARN( - "Thread %d: no free %s phase buffer available, dropping records until Host catches up", thread_idx, - kind_label - ); - *current_buf_out = nullptr; - state->head.current_buf_ptr = 0; - wmb(); - } -} - -// Acquire a writable slot in the per-thread phase buffer. Handles the -// nullptr-recover path (a prior switch couldn't pop a free buffer) and the -// buffer-full → switch path. Returns nullptr if the record must be dropped; -// callers should bump `dropped_record_count` and return when nullptr. -template -static Record *acquire_phase_slot( - int thread_idx, uint32_t pool_idx, L2SwimlaneAicpuTaskPool *state, Buffer **current_buf_out, - L2SwimlaneBufferKind kind, const char *kind_label -) { - Buffer *buf = *current_buf_out; - if (buf == nullptr) { - uint32_t head = 0; - uint32_t tail = 0; - if (wait_for_free_queue_entry(&state->free_queue, &head, &tail)) { - uint64_t buf_ptr = state->free_queue.buffer_ptrs[head % PLATFORM_PROF_SLOT_COUNT]; - rmb(); - state->free_queue.head = head + 1; - if (buf_ptr == 0) { - return nullptr; - } - state->head.current_buf_ptr = buf_ptr; - state->head.current_buf_seq += 1; - wmb(); - buf = reinterpret_cast(buf_ptr); - buf->count = 0; - *current_buf_out = buf; - LOG_INFO_V0("Thread %d: recovered %s phase buffer", thread_idx, kind_label); - } - if (buf == nullptr) return nullptr; - } - - uint32_t idx = buf->count; - if (idx >= PLATFORM_PHASE_RECORDS_PER_THREAD) { - switch_phase_buffer_kind(thread_idx, pool_idx, state, current_buf_out, kind, kind_label); - buf = *current_buf_out; - if (buf == nullptr) return nullptr; - idx = buf->count; - if (idx >= PLATFORM_PHASE_RECORDS_PER_THREAD) return nullptr; - } - Record *record = &buf->records[idx]; - buf->count = idx + 1; - return record; -} - -void l2_swimlane_aicpu_record_sched_phase( - int thread_idx, L2SwimlaneSchedPhaseKind kind, uint64_t start_time, uint64_t end_time, uint32_t loop_iter, - uint32_t tasks_processed, uint32_t pop_hit, uint32_t pop_miss, const int16_t *shared_at_start, - const int16_t *shared_at_end -) { - if (!s_phase_initialized) return; - auto *state = s_sched_phase_pools[thread_idx]; - if (state == nullptr) return; - - state->head.total_record_count += 1; - - auto *record = acquire_phase_slot( - /*thread_idx=*/thread_idx, /*pool_idx=*/static_cast(thread_idx), state, - &s_current_sched_phase_buffers[thread_idx], L2SwimlaneBufferKind::AicpuSchedPhase, "sched" - ); - if (record == nullptr) { - state->head.dropped_record_count += 1; - return; - } - record->start_time = start_time; - record->end_time = end_time; - record->loop_iter = loop_iter; - record->kind = kind; - record->tasks_processed = tasks_processed; - record->pop_hit = pop_hit; - record->pop_miss = pop_miss; - auto copy_snapshot = [](int16_t dst[L2SWIMLANE_NUM_QUEUE_SHAPES], const int16_t *src) { - if (src == nullptr) { - for (int i = 0; i < L2SWIMLANE_NUM_QUEUE_SHAPES; i++) - dst[i] = 0; - } else { - for (int i = 0; i < L2SWIMLANE_NUM_QUEUE_SHAPES; i++) - dst[i] = src[i]; - } - }; - copy_snapshot(record->shared_depth_at_start, shared_at_start); - copy_snapshot(record->shared_depth_at_end, shared_at_end); -} - -void l2_swimlane_aicpu_set_orch_thread_idx(int thread_idx) { s_orch_thread_idx = thread_idx; } - -void l2_swimlane_aicpu_record_orch_phase( - uint64_t start_time, uint64_t end_time, uint64_t task_id, uint32_t submit_idx -) { - if (s_orch_thread_idx < 0 || !s_phase_initialized) return; - // Single orch instance (dep_gen / scope_stats style): all orch records - // funnel into pool ordinal 0, regardless of which AICPU thread the - // orchestrator runs on. s_orch_thread_idx is the orchestrator's AICPU - // thread index — used only to pick its own ready queue (SPSC owner); the - // entry is tagged with pool ordinal 0 so the host files it into orch lane 0. - auto *state = s_orch_phase_pools[0]; - if (state == nullptr) return; - - state->head.total_record_count += 1; - - auto *record = acquire_phase_slot( - /*thread_idx=*/s_orch_thread_idx, /*pool_idx=*/0, state, &s_current_orch_phase_buffers[0], - L2SwimlaneBufferKind::AicpuOrchPhase, "orch" - ); - if (record == nullptr) { - state->head.dropped_record_count += 1; - return; - } - record->start_time = start_time; - record->end_time = end_time; - record->task_id = task_id; - record->submit_idx = submit_idx; -} - -// Final-drain flush of one phase pool's active buffer. `thread_idx` / `pool_idx` -// as in switch_phase_buffer_kind. -static void flush_phase_pool( - int thread_idx, uint32_t pool_idx, L2SwimlaneAicpuTaskPool *state, L2SwimlaneBufferKind kind, const char *kind_label -) { - if (state == nullptr) return; - rmb(); - uint64_t buf_ptr = state->head.current_buf_ptr; - if (buf_ptr == 0) return; - // `count` sits AFTER the records[] array in TypedBuffer, so its byte offset - // is N * sizeof(Record) — different for sched (40B) vs orch (32B) records. - // Read/write it through the matching buffer type; a single fixed cast reads - // past the orch buffer, sees 0, and silently skips the orch flush. - volatile uint32_t *count_ptr = (kind == L2SwimlaneBufferKind::AicpuOrchPhase) ? - &reinterpret_cast(buf_ptr)->count : - &reinterpret_cast(buf_ptr)->count; - if (*count_ptr == 0) return; - uint32_t seq = state->head.current_buf_seq; - int rc = enqueue_ready_buffer(s_l2_swimlane_header, thread_idx, pool_idx, buf_ptr, seq, kind); - if (rc == 0) { - LOG_INFO_V0("Thread %d: flushed %s phase buffer with %u records", thread_idx, kind_label, *count_ptr); - } else { - LOG_ERROR( - "Thread %d: failed to enqueue %s phase buffer (queue full), %u records lost!", thread_idx, kind_label, - *count_ptr - ); - state->head.dropped_record_count += *count_ptr; - *count_ptr = 0; - } - state->head.current_buf_ptr = 0; - wmb(); -} - -// Final-drain flush of the scheduler-phase pool owned by this scheduler thread. -void l2_swimlane_aicpu_flush_sched_phase_buffer(int thread_idx) { - if (!s_phase_initialized || s_l2_swimlane_header == nullptr) return; - flush_phase_pool( - thread_idx, static_cast(thread_idx), s_sched_phase_pools[thread_idx], - L2SwimlaneBufferKind::AicpuSchedPhase, "sched" - ); - s_current_sched_phase_buffers[thread_idx] = nullptr; -} - -// Final-drain flush of the single orchestrator's orch-phase pool (ordinal 0). -// Called once by the orchestrator thread at orchestration end; see -// record_orch_phase for the pool-0 / own-ready-queue tagging. -void l2_swimlane_aicpu_flush_orch_phase_buffer(int thread_idx) { - if (!s_phase_initialized || s_l2_swimlane_header == nullptr) return; - flush_phase_pool(thread_idx, /*pool_idx=*/0, s_orch_phase_pools[0], L2SwimlaneBufferKind::AicpuOrchPhase, "orch"); - s_current_orch_phase_buffers[0] = nullptr; -} - -void l2_swimlane_aicpu_init_core_assignments(int total_cores) { - if (!s_phase_initialized) { - return; - } - memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread)); - s_l2_swimlane_header->num_phase_cores = static_cast(total_cores); - wmb(); - LOG_INFO_V0("Core-to-thread mapping init: %d cores", total_cores); -} - -void l2_swimlane_aicpu_write_core_assignments_for_thread(int thread_idx, const int *core_ids, int core_num) { - if (!s_phase_initialized) { - return; - } - for (int i = 0; i < core_num; i++) { - int core_id = core_ids[i]; - if (core_id >= 0 && core_id < PLATFORM_MAX_CORES) { - s_l2_swimlane_header->core_to_thread[core_id] = static_cast(thread_idx); - } - } - wmb(); -} diff --git a/src/a5/platform/sim/host/CMakeLists.txt b/src/a5/platform/sim/host/CMakeLists.txt index 396ef1edf..6a5ca9e3a 100644 --- a/src/a5/platform/sim/host/CMakeLists.txt +++ b/src/a5/platform/sim/host/CMakeLists.txt @@ -45,9 +45,9 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/profiling_copy.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/pto_runtime_c_api.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/platform_compile_info.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/l2_swimlane_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/l2_swimlane_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/pmu_collector.cpp" - "${CMAKE_CURRENT_SOURCE_DIR}/../../shared/host/dep_gen_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/dep_gen_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/scope_stats_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/shared/host/tensor_dump_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform/sim/aicpu/platform_aicpu_affinity.cpp" diff --git a/src/a5/platform/include/host/dep_gen_collector.h b/src/common/platform/include/host/dep_gen_collector.h similarity index 94% rename from src/a5/platform/include/host/dep_gen_collector.h rename to src/common/platform/include/host/dep_gen_collector.h index 6b8f8cfb8..dddab9cd4 100644 --- a/src/a5/platform/include/host/dep_gen_collector.h +++ b/src/common/platform/include/host/dep_gen_collector.h @@ -11,16 +11,16 @@ /** * @file dep_gen_collector.h - * @brief Host-side dep_gen (SubmitTrace) buffer allocation, streaming - * collection, and raw binary export. + * @brief Host-side dep_gen (SubmitTrace) buffer allocation and streaming + * collection for in-memory replay. * * Architecture: * - BufferPoolManager: shared mgmt-thread infrastructure that * polls per-thread ready queues, drains done-queue shards, and replenishes * the single instance's free_queue from a unified recycled pool. * - DepGenCollector: collector thread shards pop full DepGenBuffers from the - * manager and append their DepGenRecords to a binary file - * (submit_trace.bin). + * manager and append their DepGenRecords to an in-memory vector consumed by + * host replay after device execution completes. * * Lifecycle: * init() — Allocate header + 1 BufferState + N DepGenBuffers @@ -36,14 +36,13 @@ * (incomplete graph; user gets a warning). * finalize() — Free all device memory, unregister. * - * Output format (submit_trace.bin): a fixed-size header followed by a - * contiguous stream of DepGenRecord values. Replay (future PR) reads this - * back. Layout intentionally trivial (no varint / framing) so the - * `sizeof(DepGenRecord)` ABI in `common/dep_gen.h` is the only contract. + * Output contract: a contiguous in-memory stream of DepGenRecord values. + * Host replay consumes this stream directly; no submit_trace.bin intermediary + * is written by the collector. */ -#ifndef SRC_A5_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ -#define SRC_A5_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ +#ifndef SRC_COMMON_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ +#define SRC_COMMON_PLATFORM_INCLUDE_HOST_DEP_GEN_COLLECTOR_H_ #include #include @@ -180,7 +179,7 @@ class DepGenCollector : public profiling_common::ProfilerBase #include @@ -325,7 +325,7 @@ class L2SwimlaneCollector : public profiling_common::ProfilerBase guard(manager_, free_cb); @@ -112,14 +113,14 @@ int L2SwimlaneCollector::initialize( LOG_DEBUG(" Total shared memory: %zu bytes (%zu KB)", total_size, total_size / 1024); // Step 2: Allocate the shared-memory region (header + SPSC slot arrays) - // via the base allocator. On a5 there is no halHostRegister and the - // device HBM region is not host-addressable, so alloc_paired_buffer - // mallocs a host shadow and seeds the device copy (the shadow path is - // selected by the copy_to_device callback installed in set_memory_context - // above). The host initializes the region through perf_host_ptr below, - // and a single profiling_copy_to_device at the end of init pushes the - // primed state to the device. Writing perf_host_ptr directly to the raw - // device pointer here would SIGSEGV — see set_memory_context above. + // via the base allocator. Non-SVM platforms do not expose device HBM as + // host-addressable memory, so alloc_paired_buffer mallocs a host shadow and + // seeds the device copy (the shadow path is selected by the copy_to_device + // callback installed in set_memory_context above). The host initializes the + // region through perf_host_ptr below, and a single profiling_copy_to_device + // at the end of init pushes the primed state to the device. Writing + // perf_host_ptr directly to the raw device pointer there would SIGSEGV — + // see set_memory_context above. void *perf_host_ptr = nullptr; void *perf_dev_ptr = alloc_paired_buffer(total_size, &perf_host_ptr); if (perf_dev_ptr == nullptr) { @@ -170,7 +171,9 @@ int L2SwimlaneCollector::initialize( LOG_DEBUG(" buffer_capacity: %d", PLATFORM_PROF_BUFFER_SIZE); LOG_DEBUG(" queue capacity: %d", PLATFORM_PROF_READYQUEUE_SIZE); - // Step 5: Initialize L2SwimlaneAicpuTaskPools — 1 buffer per core in free_queue, rest to recycled pool + // Step 5: Initialize L2SwimlaneAicpuTaskPools. Seed as many buffers as + // the device-side free_queue can hold; any remaining buffers stay in the + // host recycled pool. for (int i = 0; i < num_aicore; i++) { L2SwimlaneAicpuTaskPool *state = get_perf_buffer_state(perf_host_ptr, i); memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool)); @@ -180,6 +183,9 @@ int L2SwimlaneCollector::initialize( state->head.current_buf_ptr = 0; state->head.current_buf_seq = 0; + const int initial_free_count = (PLATFORM_PROF_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ? + PLATFORM_PROF_BUFFERS_PER_CORE : + PLATFORM_PROF_SLOT_COUNT; for (int s = 0; s < PLATFORM_PROF_BUFFERS_PER_CORE; s++) { void *host_buf_ptr = nullptr; void *dev_buf_ptr = alloc_paired_buffer(sizeof(L2SwimlaneAicpuTaskBuffer), &host_buf_ptr); @@ -191,14 +197,14 @@ int L2SwimlaneCollector::initialize( memset(buf, 0, sizeof(L2SwimlaneAicpuTaskBuffer)); buf->count = 0; - if (s == 0) { - state->free_queue.buffer_ptrs[0] = reinterpret_cast(dev_buf_ptr); + if (s < initial_free_count) { + state->free_queue.buffer_ptrs[s] = reinterpret_cast(dev_buf_ptr); } else { manager_.push_recycled(static_cast(ProfBufferType::AICPU_TASK), dev_buf_ptr); } } wmb(); - state->free_queue.tail = 1; + state->free_queue.tail = static_cast(initial_free_count); wmb(); } @@ -208,6 +214,9 @@ int L2SwimlaneCollector::initialize( L2SwimlaneAicoreTaskPool *ac_state = get_aicore_buffer_state(perf_host_ptr, num_aicore, i); memset(ac_state, 0, sizeof(L2SwimlaneAicoreTaskPool)); + const int initial_free_count = (PLATFORM_AICORE_BUFFERS_PER_CORE < PLATFORM_PROF_SLOT_COUNT) ? + PLATFORM_AICORE_BUFFERS_PER_CORE : + PLATFORM_PROF_SLOT_COUNT; for (int s = 0; s < PLATFORM_AICORE_BUFFERS_PER_CORE; s++) { void *host_buf_ptr = nullptr; void *dev_buf_ptr = alloc_paired_buffer(sizeof(L2SwimlaneAicoreTaskBuffer), &host_buf_ptr); @@ -219,20 +228,19 @@ int L2SwimlaneCollector::initialize( memset(buf, 0, sizeof(L2SwimlaneAicoreTaskBuffer)); buf->count = 0; - if (s == 0) { - ac_state->free_queue.buffer_ptrs[0] = reinterpret_cast(dev_buf_ptr); + if (s < initial_free_count) { + ac_state->free_queue.buffer_ptrs[s] = reinterpret_cast(dev_buf_ptr); } else { manager_.push_recycled(static_cast(ProfBufferType::AICORE_TASK), dev_buf_ptr); } } wmb(); - ac_state->free_queue.tail = 1; + ac_state->free_queue.tail = static_cast(initial_free_count); wmb(); } LOG_DEBUG( - "Initialized buffer pools: %d L2SwimlaneAicpuTaskBuffers/core + %d L2SwimlaneAicoreTaskBuffers/core (1 in " - "free_queue, " - "rest in recycled pool)", + "Initialized buffer pools: %d L2SwimlaneAicpuTaskBuffers/core + %d L2SwimlaneAicoreTaskBuffers/core " + "(seeded up to PLATFORM_PROF_SLOT_COUNT free_queue slots, rest in recycled pool)", PLATFORM_PROF_BUFFERS_PER_CORE, PLATFORM_AICORE_BUFFERS_PER_CORE ); @@ -260,7 +268,8 @@ int L2SwimlaneCollector::initialize( // Step 6: Initialize per-thread phase pools — both sched and orch. Each // pool is sized to its own PLATFORM_PROF_{SCHED,ORCH}_BUFFERS_PER_THREAD - // (1 in free_queue, rest in the recycled pool tagged by kind). Templated on the + // (up to PLATFORM_PROF_SLOT_COUNT in free_queue, rest in the recycled pool + // tagged by kind). Templated on the // concrete TypedBuffer so the `count` zero-store uses the matching layout // — sched and orch buffers have DIFFERENT sizes (64B vs 32B records), // so a single cast type for both would land the count store past the end @@ -279,6 +288,8 @@ int L2SwimlaneCollector::initialize( auto *state = get_state(perf_host_ptr, num_aicore, t); memset(state, 0, sizeof(L2SwimlaneAicpuTaskPool)); if (t >= buffer_count) continue; // zeroed state only; no buffers (unused slot) + const int initial_free_count = + (buffers_per_thread < PLATFORM_PROF_SLOT_COUNT) ? buffers_per_thread : PLATFORM_PROF_SLOT_COUNT; for (int s = 0; s < buffers_per_thread; s++) { void *host_buf_ptr = nullptr; void *dev_buf_ptr = alloc_paired_buffer(buffer_bytes, &host_buf_ptr); @@ -290,14 +301,14 @@ int L2SwimlaneCollector::initialize( // matching Buffer type. The records payload is overwritten by // AICPU on first use. reinterpret_cast(host_buf_ptr)->count = 0; - if (s == 0) { - state->free_queue.buffer_ptrs[0] = reinterpret_cast(dev_buf_ptr); + if (s < initial_free_count) { + state->free_queue.buffer_ptrs[s] = reinterpret_cast(dev_buf_ptr); } else { manager_.push_recycled(static_cast(recycle_kind), dev_buf_ptr); } } wmb(); - state->free_queue.tail = 1; + state->free_queue.tail = static_cast(initial_free_count); wmb(); } return 0; @@ -340,7 +351,7 @@ int L2SwimlaneCollector::initialize( wmb(); // Push the host-initialized region (header + every pool's primed - // free_queue tail/buffer_ptrs[0]) down to the device. perf_host_ptr is a + // free_queue tail/buffer_ptrs[]) down to the device. perf_host_ptr is a // malloc'd shadow distinct from the device HBM region, so without this the // device never sees the primed free queues and AICPU/AICore read zeros. // The mgmt-loop mirror is read-only (device→host) and never re-pushes this @@ -372,7 +383,7 @@ int L2SwimlaneCollector::initialize( perf_shared_mem_dev_ = perf_dev_ptr; aicore_ring_addr_table_dev_ = rotation_table_dev; set_memory_context( - alloc_cb, register_cb, free_cb, profiling_copy_to_device_for_ops, profiling_copy_from_device_for_ops, + alloc_cb, register_cb, free_cb, profiling_copy_to_device_or_null(), profiling_copy_from_device_or_null(), perf_dev_ptr, perf_host_ptr, total_size, device_id ); return 0; @@ -959,9 +970,9 @@ int L2SwimlaneCollector::finalize(L2SwimlaneUnregisterCallback unregister_cb, co // Every release site below goes through release_one_buffer so an // optional halHostRegister unregister and the free stay an inseparable // pair — each dev_ptr a register_cb mapped is unregistered before its - // device memory is freed. On a5 register_cb is null (no halHostRegister) - // so the unregister branch is a no-op and only the device free runs; the - // paired host shadows are reclaimed separately by clear_mappings() below. + // device memory is freed. On non-SVM platforms register_cb is null, so the + // unregister branch is a no-op and only the device free runs; the paired + // host shadows are reclaimed separately by clear_mappings() below. // The pairing matters on a2a3, where leaking HAL registrations across // init_l2_swimlane() invocations makes back-to-back tests on a reused // Worker fail at rc=8 from halHostRegister. @@ -1043,7 +1054,7 @@ int L2SwimlaneCollector::finalize(L2SwimlaneUnregisterCallback unregister_cb, co // Free any malloc'd host shadows still tracked in the manager's // malloc_shadows_ — the shm region, rotation table, and per-pool buffers // were freed above via release_one_buffer (device pointer only), so their - // paired shadows (allocated by alloc_paired_buffer on a5's no-SVM path) + // paired shadows (allocated by alloc_paired_buffer on the non-SVM path) // never went through release_owned_buffers. clear_mappings() std::free's // them. No-op on SVM (host_ptr == dev_ptr, nothing in malloc_shadows_). // Matches PMU / DepGen finalize. diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py index 7377b545c..ef1233128 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py @@ -53,7 +53,7 @@ def _task_id(ring: int, local: int) -> int: @scene_test(level=2, runtime="tensormap_and_ringbuffer") class TestDepGen(SceneTestCase): - """Vector example, run with dep_gen enabled, then verify submit_trace.bin.""" + """Vector example, run with dep_gen enabled, then verify generated deps.""" CALLABLE = { "orchestration": { diff --git a/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py b/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py index 3b621e749..04777d9a0 100644 --- a/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py +++ b/tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py @@ -53,7 +53,7 @@ def _task_id(ring: int, local: int) -> int: @scene_test(level=2, runtime="tensormap_and_ringbuffer") class TestDepGen(SceneTestCase): - """Vector example, run with dep_gen enabled, then verify submit_trace.bin.""" + """Vector example, run with dep_gen enabled, then verify generated deps.""" CALLABLE = { "orchestration": { diff --git a/tests/ut/cpp/CMakeLists.txt b/tests/ut/cpp/CMakeLists.txt index 0c868a45f..1a565382c 100644 --- a/tests/ut/cpp/CMakeLists.txt +++ b/tests/ut/cpp/CMakeLists.txt @@ -470,7 +470,7 @@ add_executable(test_l3_l2_orch_comm_sim_runner ${CMAKE_SOURCE_DIR}/../../../src/common/platform/sim/sim_context/cpu_sim_context.cpp ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/runtime.cpp ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/sim/host/profiling_copy.cpp - ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/shared/host/l2_swimlane_collector.cpp + ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/l2_swimlane_collector.cpp ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/shared/host/pmu_collector.cpp ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/tensor_dump_collector.cpp ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/scope_stats_collector.cpp @@ -719,7 +719,7 @@ if(SIMPLER_ENABLE_HARDWARE_TESTS) ${PROJECT_ROOT}/src/common/platform/shared/host/tensor_dump_collector.cpp ${PROJECT_ROOT}/src/a5/platform/onboard/host/profiling_copy.cpp ${PROJECT_ROOT}/src/a5/platform/onboard/host/memory_allocator.cpp - ${PROJECT_ROOT}/src/a5/platform/shared/host/l2_swimlane_collector.cpp + ${PROJECT_ROOT}/src/common/platform/shared/host/l2_swimlane_collector.cpp ${PROJECT_ROOT}/src/a5/platform/shared/host/pmu_collector.cpp ) target_compile_options(test_l3_l2_orch_comm_onboard_runner PRIVATE