[Question/Feedback] Implementing Prefix Sum - Guidance on Opaque/InCore vs Split Kernels #1153

vloncar · 2026-04-24T13:53:01Z

vloncar
Apr 24, 2026

Hi PyPTO team! 👋

I've been exploring the framework by implementing prefix-sum algorithm (ScanU, Algorithm 1) from the recent paper Parallelizing Linear Recurrent Neural Networks (arXiv:2505.15112). The algorithm calculates prefix sums using a combination of matrix multiplication (suited for the CUBE core) and row-by-row vector accumulations (suited for the Vector core).

For context, here is a very simple example of PyTorch implementation I am trying to port:

def reference_scan(x: torch.Tensor, u: torch.Tensor, out: torch.Tensor):
    num_rows = x.shape[0]
    # 1. Cube Phase: Matrix multiplication (row-wise prefix sum)
    row_sums = torch.matmul(x, u)
    
    # 2. Vector Phase: Sequential accumulation across rows
    running_sum = torch.zeros((1,), device=x.device, dtype=x.dtype)
    for i in range(num_rows):
        current_row = row_sums[i] + running_sum
        out[i] = current_row
        running_sum = current_row[-1]
    return out

Following the 00-getting_started.md guide and looking through the test examples, I made three different attempts. I eventually got a split-kernel approach working (in simulation, since all PyPTO programs hang on the 910B4 device that I am using), but I wanted to share my feedback on the first two attempts and ask if I missed any obvious idiomatic patterns.

Attempt 1: Tensor-Level (Opaque)

I first tried a naive tensor-level port using @pl.function (making the function of type=pl.FunctionType.Opaque). I attempted to mix pl.matmul(x, u) with a pl.range loop to update the running_sum and assemble the rows, exactly like the reference implementation.
Result: This failed to compile, complaining about a lack of kernel code for opaque functions. I found that most runnable examples actually use pl.FunctionType.InCore and pl.FunctionType.Orchestration so I switched to that.

Attempt 2: Tile-Level (InCore)

Next, I tried a hardware-level InCore implementation. Instead of working at tensor level, I had to implement tile-level operations, loading tiles to L0A and L0B, doing the matmul, and then attempting to move the result from L0C to Vec memory to do the row-by-row accumulations.
Result: Compilation failed at the pl.move(c_l0c, target_memory=pl.MemorySpace.Vec) step with the following internal error:

Internal error: GetValidatedTpopSplit called for var not in fs_.tpop_result_vars

It is unclear if I need special set-up for CUBE<->AIV sync like I need in pure PTO-ISA C++ (either via TPUSH/TPOP or TSYNC). The existing examples of cross-core algorithms are simplistic and don't show any additional setup.

Attempt 3: Split Kernels (AIC + AIV) - Success!

Finally, I split the computation explicitly into an AIC (Cube) function and an AIV (Vector) function, chained together in the Orchestrator via a temporary global memory buffer (tmp_out).
Result: This compiled and passed against the torch.cumsum baseline in the a2a3sim simulator! 🎉

# Snippet of the working Orchestrator:
@pl.function(type=pl.FunctionType.Orchestration)
def orchestrator(self, x, u, out):
    tmp_out = pl.create_tensor([num_rows, 64], dtype=pl.FP32)
    tmp_out = self.scan_cube(x, u, tmp_out)  # AIC
    out_ret = self.scan_vector(tmp_out, out) # AIV
    return out_ret

Questions & Guidance Request:

Is Attempt 3 (Split Kernels) the recommended/idiomatic way to handle workloads that require sequentially passing data from the Cube to the Vector core?
Regarding Attempt 2: Is there a correct way to write this as a single InCore function? Or does moving data directly from L0C to Vec memory for immediate element-wise manipulation require a specific cross-core (TPUSH/TPOP) setup?
Return from orchestrator: I noticed that I get different behavior if I do return self.scan_core(...) instead of capturing the output and returning that object. Is this a mandatory pattern in PyPTO?
Memory allocation: In Attempt 3, the algorithm itself doesn't strictly need tmp_out, but if I try to reuse out for both steps, the code produces incorrect results. Is an explicit temporary global tensor always required when chaining AIC and AIV functions like this?
Loop-carried values: I noticed that pl.range() seems to execute correctly even if I omit init_values for running_sum, though the documentation highly recommends it. Are there edge cases where omitting init_values causes undefined behavior in loop-carried dependencies?

I have attached my full test_scan.py script containing all three implementations if you'd like to take a look. All implementations are based on the latest PyPTO code as of writing this report.

Thank you for building such an interesting framework! Any guidance on whether my direction is correct or if I missed something obvious would be greatly appreciated.

test_scan.py

learning-chip · 2026-04-28T08:57:52Z

learning-chip
Apr 28, 2026

From LLM application point of view, scan or all-scan is more useful (more necessary) in multi-device sequence-parallel linear attention. For single-device kernel, the chunk algorithm (rather than scan) is almost always superior to prefix sum, see this comparison between scan-based vs chunkwise algorithm

1 reply

vloncar Apr 28, 2026
Author

Thanks for pointing that out. This will be very valuable when we start implementing kernels that aim for maximum performance.

The algorithm in this question is of little importance, this particular scan was chosen as it is simple to understand, implement and ultimately seek guidance on in case of any problems with frameworks. The goal is to get familiar with the programming practices of pypto, not seek performance gains. The questions raised are applicable to any algorithm and I would likely encounter same issues in the implementation of the chunk algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question/Feedback] Implementing Prefix Sum - Guidance on Opaque/InCore vs Split Kernels #1153

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[Question/Feedback] Implementing Prefix Sum - Guidance on Opaque/InCore vs Split Kernels #1153

Uh oh!

vloncar Apr 24, 2026

Attempt 1: Tensor-Level (Opaque)

Attempt 2: Tile-Level (InCore)

Attempt 3: Split Kernels (AIC + AIV) - Success!

Questions & Guidance Request:

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

learning-chip Apr 28, 2026

Uh oh!

vloncar Apr 28, 2026 Author

vloncar
Apr 24, 2026

Replies: 1 comment 1 reply

learning-chip
Apr 28, 2026

vloncar Apr 28, 2026
Author