Replies: 1 comment 1 reply
-
|
From LLM application point of view, scan or all-scan is more useful (more necessary) in multi-device sequence-parallel linear attention. For single-device kernel, the chunk algorithm (rather than scan) is almost always superior to prefix sum, see this comparison between scan-based vs chunkwise algorithm |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi PyPTO team! 👋
I've been exploring the framework by implementing prefix-sum algorithm (ScanU, Algorithm 1) from the recent paper Parallelizing Linear Recurrent Neural Networks (arXiv:2505.15112). The algorithm calculates prefix sums using a combination of matrix multiplication (suited for the CUBE core) and row-by-row vector accumulations (suited for the Vector core).
For context, here is a very simple example of PyTorch implementation I am trying to port:
Following the
00-getting_started.mdguide and looking through the test examples, I made three different attempts. I eventually got a split-kernel approach working (in simulation, since all PyPTO programs hang on the 910B4 device that I am using), but I wanted to share my feedback on the first two attempts and ask if I missed any obvious idiomatic patterns.Attempt 1: Tensor-Level (Opaque)
I first tried a naive tensor-level port using
@pl.function(making the function oftype=pl.FunctionType.Opaque). I attempted to mixpl.matmul(x, u)with apl.rangeloop to update therunning_sumand assemble the rows, exactly like the reference implementation.Result: This failed to compile, complaining about a lack of kernel code for opaque functions. I found that most runnable examples actually use
pl.FunctionType.InCoreandpl.FunctionType.Orchestrationso I switched to that.Attempt 2: Tile-Level (InCore)
Next, I tried a hardware-level
InCoreimplementation. Instead of working at tensor level, I had to implement tile-level operations, loading tiles toL0AandL0B, doing the matmul, and then attempting to move the result fromL0CtoVecmemory to do the row-by-row accumulations.Result: Compilation failed at the
pl.move(c_l0c, target_memory=pl.MemorySpace.Vec)step with the following internal error:It is unclear if I need special set-up for
CUBE<->AIVsync like I need in pure PTO-ISA C++ (either viaTPUSH/TPOPorTSYNC). The existing examples of cross-core algorithms are simplistic and don't show any additional setup.Attempt 3: Split Kernels (AIC + AIV) - Success!
Finally, I split the computation explicitly into an
AIC(Cube) function and anAIV(Vector) function, chained together in the Orchestrator via a temporary global memory buffer (tmp_out).Result: This compiled and passed against the
torch.cumsumbaseline in thea2a3simsimulator! 🎉Questions & Guidance Request:
InCorefunction? Or does moving data directly fromL0CtoVecmemory for immediate element-wise manipulation require a specific cross-core (TPUSH/TPOP) setup?return self.scan_core(...)instead of capturing the output and returning that object. Is this a mandatory pattern in PyPTO?tmp_out, but if I try to reuseoutfor both steps, the code produces incorrect results. Is an explicit temporary global tensor always required when chainingAICandAIVfunctions like this?pl.range()seems to execute correctly even if I omitinit_valuesforrunning_sum, though the documentation highly recommends it. Are there edge cases where omittinginit_valuescauses undefined behavior in loop-carried dependencies?I have attached my full
test_scan.pyscript containing all three implementations if you'd like to take a look. All implementations are based on the latest PyPTO code as of writing this report.Thank you for building such an interesting framework! Any guidance on whether my direction is correct or if I missed something obvious would be greatly appreciated.
test_scan.py
Beta Was this translation helpful? Give feedback.
All reactions