Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/dfx/dep_gen.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,8 +358,8 @@ list; only the dep_gen replay graph loses the tail.
| Layer | File | Role |
| ----- | ---- | ---- |
| Shared-mem layout | `src/{a2a3,a5}/platform/include/common/dep_gen.h` | `DepGenRecord` (4672 B base, cache-line aligned, ≤64 inline explicit_deps, per-task `block_num`) + `DepGenOverflowRecord` chain view (≤582 deps per slot) + SPSC ring + per-thread ready queue. Byte-identical layout across platforms. |
| AICPU writer | `src/{a2a3,a5}/platform/{include,shared}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build. a5 reuses the a2a3 source verbatim — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. |
| Host collector | `src/{a2a3,a5}/platform/{include/host,shared/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector. On a5 (no SVM) it uses the base `alloc_paired_buffer`, which malloc's a host shadow + `copy_to_device`'s it and registers it via `add_malloc_shadow` so teardown can free it; `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. |
| AICPU writer | `src/{a2a3,a5}/platform/include/aicpu/dep_gen_collector_aicpu.h`, `src/common/platform/shared/aicpu/dep_gen_collector_aicpu.cpp` | Single-instance write path; weak-fallback exported to host build. Both platforms share the same writer implementation — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. |
| Host collector | `src/common/platform/include/host/dep_gen_collector.h`, `src/common/platform/shared/host/dep_gen_collector.cpp` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector. On non-SVM platforms it uses the base `alloc_paired_buffer`, which malloc's a host shadow + `copy_to_device`'s it and registers it via `add_malloc_shadow` so teardown can free it; `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. |
| Capture call site | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. The schema carries `kernel_ids[3] = {aic, aiv0, aiv1}` so the swimlane post-processor can resolve `task_id → kernel` from `deps.json` at level=1 where the AICore record is the sole device-side identity source. Inactive subslots stay at `INVALID_KERNEL_ID = -1`. It also carries the SPMD logical block num (`block_num` on a2a3, `core_num` on a5's launch spec) as `tasks[].block_num`. |
| Replay | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. Platform-agnostic — a5 reuses the a2a3 source verbatim. |
| Device-runner hookup | `src/{a2a3,a5}/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path)` |
Expand Down
4 changes: 2 additions & 2 deletions docs/dfx/l2-swimlane-profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -708,7 +708,7 @@ export_swimlane_json() ← writes <output_prefix>/l2_swimlane_record
finalize(unregister, free)
```

[`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the collector link targets.

../src/... from docs/dfx/l2-swimlane-profiling.md resolves under docs/src/..., so both links are broken as written. Use ../../src/... for the shared header in both sections.

Suggested fix
-[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)
+[`L2SwimlaneCollector`](../../src/common/platform/include/host/l2_swimlane_collector.h)

Apply the same replacement in both the a2a3 and a5 sections.

Also applies to: 841-841

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/dfx/l2-swimlane-profiling.md` at line 711, The collector links in the
profiling doc are resolving to the wrong location because the relative path is
missing one directory level; update the `L2SwimlaneCollector` link targets in
both the a2a3 and a5 sections to use the shared header path with the correct
`../../src/...` prefix so the markdown points to the actual file.

on a2a3 inherits from
[`profiling_common::ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>`](../src/common/platform/include/host/profiler_base.h):
the base class owns split mgmt threads, collector shards, and the
Expand Down Expand Up @@ -838,7 +838,7 @@ l2_swimlane_collector_.export_swimlane_json()
l2_swimlane_collector_.finalize()
```

[`L2SwimlaneCollector`](../src/a5/platform/include/host/l2_swimlane_collector.h)
[`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)
on a5 inherits the same CRTP base
([`profiling_common::ProfilerBase`](../src/common/platform/include/host/profiler_base.h))
as a2a3 and parameterizes
Expand Down
2 changes: 1 addition & 1 deletion docs/hardware/cache-coherency.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Two separate concerns, often conflated:
`rmb()` between the COND check and the slot reads.

Concretely, the L2 swimlane staging-slot read in
`src/{a2a3,a5}/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` does
`src/common/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` does
**not** call `cache_invalidate_range` on the slot, but it **does** call
`rmb()` before reading `slot->task_id` and the timing fields. All of
those fields are AICore writes covered by the AICore-side `dcci` in
Expand Down
8 changes: 4 additions & 4 deletions docs/profiling-framework.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ the required members are:

The Module structs are defined alongside their collectors in
[pmu_collector.h](../src/a2a3/platform/include/host/pmu_collector.h),
[l2_swimlane_collector.h](../src/a2a3/platform/include/host/l2_swimlane_collector.h),
[dep_gen_collector.h](../src/a2a3/platform/include/host/dep_gen_collector.h),
[l2_swimlane_collector.h](../src/common/platform/include/host/l2_swimlane_collector.h),
[dep_gen_collector.h](../src/common/platform/include/host/dep_gen_collector.h),
[tensor_dump_collector.h](../src/common/platform/include/host/tensor_dump_collector.h),
and
[scope_stats_collector.h](../src/common/platform/include/host/scope_stats_collector.h)
Expand Down Expand Up @@ -336,13 +336,13 @@ Existing collectors are the canonical examples:

- [`PmuCollector`](../src/a2a3/platform/include/host/pmu_collector.h)
— single kind, per-core instances. See [pmu-profiling.md](dfx/pmu-profiling.md).
- [`DepGenCollector`](../src/a2a3/platform/include/host/dep_gen_collector.h)
- [`DepGenCollector`](../src/common/platform/include/host/dep_gen_collector.h)
— single kind, one instance. See [dep_gen.md](dfx/dep_gen.md).
- [`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h)
— single kind, per-AICPU-thread instances. See [args-dump.md](dfx/args-dump.md).
- [`ScopeStatsCollector`](../src/common/platform/include/host/scope_stats_collector.h)
— single kind, one instance. See [scope-stats.md](dfx/scope-stats.md).
- [`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
- [`L2SwimlaneCollector`](../src/common/platform/include/host/l2_swimlane_collector.h)
— four kinds (AICPU task, scheduler phase, orchestrator phase, AICore
task), per-core / per-thread instances; the canonical multi-kind example. See
[l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md).
Expand Down
294 changes: 0 additions & 294 deletions src/a2a3/platform/include/host/dep_gen_collector.h

This file was deleted.

Loading
Loading