Summary
AllocateMemoryAddr (src/ir/transforms/allocate_memory_addr_pass.cpp) is a pure bump allocator: it gives each MemoryReuse base-Ptr group its own non-overlapping address range (current_addr = AlignAddress(current_addr + slot_size, space) per group) with no awareness of cross-group lifetime disjointness, and never subdivides a freed region. Replace it with liveness-aware offset-packing — place each buffer at the lowest offset free for its entire lifetime (first-fit-by-lifetime / interval coloring) — so a freed region can host differently-shaped, lifetime-disjoint co-live buffers.
Motivation / Use Case
Surfaced by the roofline-chooser work (AutoTileMatmulL0 operand-stationary + Mat-scratch chained matmuls). A chained matmul (a@b)@c kept on-chip in an L1/Mat scratch is fully sequential — the producer computes d into L1, then the consumer reads d from L1 — so the producer's L0A operand buffer and the consumer's L0A double-buffers are used in sequence and should reuse the same space (peak 64 KB).
But when the producer is A-stationary (one monolithic 64 KB held-A buffer) and the consumer is double-buffered (two 32 KB buffers), MemoryReuse reuses the producer's 64 KB buffer for one of the consumer's 32 KB buffers (wasting 32 KB) and allocates a fresh 32 KB for the other → 64 + 32 = 96 KB > 64 KB L0A → overflow. The freed 64 KB region cannot be re-subdivided into the consumer's 2×32 KB.
The current workaround (in the roofline-chooser PR) constrains chained-matmul (Mat-scratch) producers to output-stationary so their buffer shapes match the consumer's double-buffer and pack. That gives up A/B-stationary for chained producers. Offset-packing would retire the workaround and is symmetric — it fixes producer-A-stationary, consumer-A-stationary, and fragmentation alike.
Proposed API / Behavior
In AllocateMemoryAddr, replace the per-group bump with first-fit-by-lifetime: compute each base-Ptr group's lifetime interval (the union of its members' [def, last_use] from MemoryReuse), then place each group at the lowest offset where [off, off+slot_size) is free for that group's whole lifetime. PTOAS's own PlanMemory already does this (it splits a freed region into head | e | tail); pypto's allocator does not.
Additional Context
- Repro: a chained matmul producer
256x256x128 (A-stationary) feeding a consumer → the full Default pipeline raises Left buffer usage 98304 bytes exceeds platform limit (65536 bytes) at AllocateMemoryAddr.
- Avoided in tests today:
tests/ut/ir/transforms/test_auto_tile_matmul_l0.py::test_chained_matmul_uses_mat_scratch deliberately uses dims (256x256x256) where the chooser picks output-stationary for both matmuls so their L0 buffers pack.
- Documented as a follow-up in
docs/en/dev/passes/14-auto_tile_matmul_l0.md (the Mat-scratch placement section).
- This is part of the broader pypto/PTOAS memory-planning rework.
Related: #1475 (a different MemoryReuse problem — over-aggressive reuse introducing a WAR hazard that serializes independent matmuls; this issue is the orthogonal offset-assignment/packing gap).
Summary
AllocateMemoryAddr(src/ir/transforms/allocate_memory_addr_pass.cpp) is a pure bump allocator: it gives eachMemoryReusebase-Ptr group its own non-overlapping address range (current_addr = AlignAddress(current_addr + slot_size, space)per group) with no awareness of cross-group lifetime disjointness, and never subdivides a freed region. Replace it with liveness-aware offset-packing — place each buffer at the lowest offset free for its entire lifetime (first-fit-by-lifetime / interval coloring) — so a freed region can host differently-shaped, lifetime-disjoint co-live buffers.Motivation / Use Case
Surfaced by the roofline-chooser work (AutoTileMatmulL0 operand-stationary + Mat-scratch chained matmuls). A chained matmul
(a@b)@ckept on-chip in an L1/Mat scratch is fully sequential — the producer computesdinto L1, then the consumer readsdfrom L1 — so the producer's L0A operand buffer and the consumer's L0A double-buffers are used in sequence and should reuse the same space (peak 64 KB).But when the producer is A-stationary (one monolithic 64 KB held-A buffer) and the consumer is double-buffered (two 32 KB buffers), MemoryReuse reuses the producer's 64 KB buffer for one of the consumer's 32 KB buffers (wasting 32 KB) and allocates a fresh 32 KB for the other →
64 + 32 = 96 KB > 64 KBL0A → overflow. The freed 64 KB region cannot be re-subdivided into the consumer's2×32 KB.The current workaround (in the roofline-chooser PR) constrains chained-matmul (Mat-scratch) producers to output-stationary so their buffer shapes match the consumer's double-buffer and pack. That gives up A/B-stationary for chained producers. Offset-packing would retire the workaround and is symmetric — it fixes producer-A-stationary, consumer-A-stationary, and fragmentation alike.
Proposed API / Behavior
In
AllocateMemoryAddr, replace the per-group bump with first-fit-by-lifetime: compute each base-Ptr group's lifetime interval (the union of its members'[def, last_use]fromMemoryReuse), then place each group at the lowest offset where[off, off+slot_size)is free for that group's whole lifetime. PTOAS's ownPlanMemoryalready does this (it splits a freed region intohead | e | tail); pypto's allocator does not.Additional Context
256x256x128(A-stationary) feeding a consumer → the full Default pipeline raisesLeft buffer usage 98304 bytes exceeds platform limit (65536 bytes)atAllocateMemoryAddr.tests/ut/ir/transforms/test_auto_tile_matmul_l0.py::test_chained_matmul_uses_mat_scratchdeliberately uses dims (256x256x256) where the chooser picks output-stationary for both matmuls so their L0 buffers pack.docs/en/dev/passes/14-auto_tile_matmul_l0.md(the Mat-scratch placement section).Related: #1475 (a different MemoryReuse problem — over-aggressive reuse introducing a WAR hazard that serializes independent matmuls; this issue is the orthogonal offset-assignment/packing gap).