[Feature] AllocateMemoryAddr: liveness-aware offset-packing for differently-shaped disjoint-lifetime L0 buffers

### Summary

`AllocateMemoryAddr` (`src/ir/transforms/allocate_memory_addr_pass.cpp`) is a pure **bump allocator**: it gives each `MemoryReuse` base-Ptr group its own non-overlapping address range (`current_addr = AlignAddress(current_addr + slot_size, space)` per group) with no awareness of cross-group lifetime disjointness, and **never subdivides a freed region**. Replace it with **liveness-aware offset-packing** — place each buffer at the lowest offset free for its *entire lifetime* (first-fit-by-lifetime / interval coloring) — so a freed region can host differently-shaped, lifetime-disjoint co-live buffers.

### Motivation / Use Case

Surfaced by the roofline-chooser work (AutoTileMatmulL0 operand-stationary + Mat-scratch chained matmuls). A chained matmul `(a@b)@c` kept on-chip in an L1/Mat scratch is **fully sequential** — the producer computes `d` into L1, then the consumer reads `d` from L1 — so the producer's L0A operand buffer and the consumer's L0A double-buffers are used in sequence and *should* reuse the same space (peak 64 KB).

But when the producer is **A-stationary** (one monolithic 64 KB held-A buffer) and the consumer is **double-buffered** (two 32 KB buffers), MemoryReuse reuses the producer's 64 KB buffer for *one* of the consumer's 32 KB buffers (wasting 32 KB) and allocates a *fresh* 32 KB for the other → `64 + 32 = 96 KB > 64 KB` L0A → **overflow**. The freed 64 KB region cannot be re-subdivided into the consumer's `2×32 KB`.

The current workaround (in the roofline-chooser PR) constrains chained-matmul (Mat-scratch) producers to **output-stationary** so their buffer shapes match the consumer's double-buffer and pack. That gives up A/B-stationary for chained producers. Offset-packing would **retire the workaround** and is **symmetric** — it fixes producer-A-stationary, consumer-A-stationary, and fragmentation alike.

### Proposed API / Behavior

In `AllocateMemoryAddr`, replace the per-group bump with **first-fit-by-lifetime**: compute each base-Ptr group's lifetime interval (the union of its members' `[def, last_use]` from `MemoryReuse`), then place each group at the lowest offset where `[off, off+slot_size)` is free for that group's whole lifetime. PTOAS's own `PlanMemory` already does this (it splits a freed region into `head | e | tail`); pypto's allocator does not.

### Additional Context

- **Repro:** a chained matmul producer `256x256x128` (A-stationary) feeding a consumer → the full Default pipeline raises `Left buffer usage 98304 bytes exceeds platform limit (65536 bytes)` at `AllocateMemoryAddr`.
- **Avoided in tests today:** `tests/ut/ir/transforms/test_auto_tile_matmul_l0.py::test_chained_matmul_uses_mat_scratch` deliberately uses dims (`256x256x256`) where the chooser picks output-stationary for both matmuls so their L0 buffers pack.
- **Documented as a follow-up** in `docs/en/dev/passes/14-auto_tile_matmul_l0.md` (the Mat-scratch placement section).
- This is part of the broader pypto/PTOAS memory-planning rework.

Related: #1475 (a *different* MemoryReuse problem — over-aggressive reuse introducing a WAR hazard that serializes independent matmuls; this issue is the orthogonal offset-assignment/packing gap).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] AllocateMemoryAddr: liveness-aware offset-packing for differently-shaped disjoint-lifetime L0 buffers #1908

Summary

Motivation / Use Case

Proposed API / Behavior

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] AllocateMemoryAddr: liveness-aware offset-packing for differently-shaped disjoint-lifetime L0 buffers #1908

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions