Skip to content

[Feature] AllocateMemoryAddr: liveness-aware offset-packing for differently-shaped disjoint-lifetime L0 buffers #1908

Description

@tonibohnlein

Summary

AllocateMemoryAddr (src/ir/transforms/allocate_memory_addr_pass.cpp) is a pure bump allocator: it gives each MemoryReuse base-Ptr group its own non-overlapping address range (current_addr = AlignAddress(current_addr + slot_size, space) per group) with no awareness of cross-group lifetime disjointness, and never subdivides a freed region. Replace it with liveness-aware offset-packing — place each buffer at the lowest offset free for its entire lifetime (first-fit-by-lifetime / interval coloring) — so a freed region can host differently-shaped, lifetime-disjoint co-live buffers.

Motivation / Use Case

Surfaced by the roofline-chooser work (AutoTileMatmulL0 operand-stationary + Mat-scratch chained matmuls). A chained matmul (a@b)@c kept on-chip in an L1/Mat scratch is fully sequential — the producer computes d into L1, then the consumer reads d from L1 — so the producer's L0A operand buffer and the consumer's L0A double-buffers are used in sequence and should reuse the same space (peak 64 KB).

But when the producer is A-stationary (one monolithic 64 KB held-A buffer) and the consumer is double-buffered (two 32 KB buffers), MemoryReuse reuses the producer's 64 KB buffer for one of the consumer's 32 KB buffers (wasting 32 KB) and allocates a fresh 32 KB for the other → 64 + 32 = 96 KB > 64 KB L0A → overflow. The freed 64 KB region cannot be re-subdivided into the consumer's 2×32 KB.

The current workaround (in the roofline-chooser PR) constrains chained-matmul (Mat-scratch) producers to output-stationary so their buffer shapes match the consumer's double-buffer and pack. That gives up A/B-stationary for chained producers. Offset-packing would retire the workaround and is symmetric — it fixes producer-A-stationary, consumer-A-stationary, and fragmentation alike.

Proposed API / Behavior

In AllocateMemoryAddr, replace the per-group bump with first-fit-by-lifetime: compute each base-Ptr group's lifetime interval (the union of its members' [def, last_use] from MemoryReuse), then place each group at the lowest offset where [off, off+slot_size) is free for that group's whole lifetime. PTOAS's own PlanMemory already does this (it splits a freed region into head | e | tail); pypto's allocator does not.

Additional Context

  • Repro: a chained matmul producer 256x256x128 (A-stationary) feeding a consumer → the full Default pipeline raises Left buffer usage 98304 bytes exceeds platform limit (65536 bytes) at AllocateMemoryAddr.
  • Avoided in tests today: tests/ut/ir/transforms/test_auto_tile_matmul_l0.py::test_chained_matmul_uses_mat_scratch deliberately uses dims (256x256x256) where the chooser picks output-stationary for both matmuls so their L0 buffers pack.
  • Documented as a follow-up in docs/en/dev/passes/14-auto_tile_matmul_l0.md (the Mat-scratch placement section).
  • This is part of the broader pypto/PTOAS memory-planning rework.

Related: #1475 (a different MemoryReuse problem — over-aggressive reuse introducing a WAR hazard that serializes independent matmuls; this issue is the orthogonal offset-assignment/packing gap).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions