[WIP] mock: high-fidelity single-stream AllGather FFN pseudo-kernel#187
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a high-fidelity pseudo-kernel for a single-PE FFN cell with an AllGather split variant, utilizing a unified L1 address space to overlap communication and computation. The review feedback identifies a Write-After-Read (WAR) hazard in the double-buffering pipeline of hiddenChunkMat where asynchronous TAllGather operations can overwrite buffers before they are read by TMOV. A synchronization mechanism using PIPE_MTE1 -> PIPE_MTE2 flags is suggested to resolve this issue.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Prime: gather chunk 0 into buffer 0. | ||
| comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[0], /*featOffset=*/0, validChunk, validN); | ||
| #ifndef __PTO_AUTO__ | ||
| set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0); | ||
| #endif | ||
|
|
||
| for (int j = 0; j < FFN_AG_CHUNKS; ++j) { | ||
| const int buf = j & 1; | ||
| const event_t bufEvt = (buf == 0) ? EVENT_ID0 : EVENT_ID1; | ||
|
|
||
| // Launch the next chunk's gather so its TLOAD (MTE2) overlaps this | ||
| // chunk's GEMM (M) below. | ||
| if (j + 1 < FFN_AG_CHUNKS) { | ||
| const int nb = (j + 1) & 1; | ||
| const event_t nbEvt = (nb == 0) ? EVENT_ID0 : EVENT_ID1; | ||
| comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[nb], (j + 1) * validChunk, validChunk, | ||
| validN); | ||
| #ifndef __PTO_AUTO__ | ||
| set_flag(PIPE_MTE2, PIPE_MTE1, nbEvt); | ||
| #endif | ||
| } | ||
|
|
||
| #ifndef __PTO_AUTO__ | ||
| wait_flag(PIPE_MTE2, PIPE_MTE1, bufEvt); | ||
| #endif | ||
| TMOV(aDownT, hiddenChunkMat[buf]); // hidden K-band L1 -> L0A | ||
| TEXTRACT(bDownT, wDownMat, j * validChunk, 0); // W_down K-band L1 -> L0B | ||
|
|
||
| #ifndef __PTO_AUTO__ | ||
| set_flag(PIPE_MTE1, PIPE_M, bufEvt); | ||
| wait_flag(PIPE_MTE1, PIPE_M, bufEvt); | ||
| #endif |
There was a problem hiding this comment.
The current double-buffering implementation for hiddenChunkMat has a Write-After-Read (WAR) hazard. While PIPE_MTE2 -> PIPE_MTE1 synchronization is present (ensuring that TAllGather completes before TMOV reads the buffer), there is no reverse PIPE_MTE1 -> PIPE_MTE2 synchronization.
Without this reverse synchronization, the asynchronous TAllGather (MTE2) in iteration j + 1 can overwrite hiddenChunkMat[nb] before the TMOV (MTE1) in iteration j has finished reading it.
To resolve this:
- Pre-set
EVENT_ID1onPIPE_MTE1 -> PIPE_MTE2before the loop (since buffer 1 is initially empty and safe to write to). - Before launching
TAllGatherfor the next chunk, wait for the corresponding event onPIPE_MTE1 -> PIPE_MTE2. - After
TMOVfinishes readinghiddenChunkMat[buf], set the corresponding event onPIPE_MTE1 -> PIPE_MTE2to signal that the buffer is safe to reuse.
// Prime: gather chunk 0 into buffer 0.
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE1, PIPE_MTE2, EVENT_ID1);
#endif
comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[0], /*featOffset=*/0, validChunk, validN);
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
#endif
for (int j = 0; j < FFN_AG_CHUNKS; ++j) {
const int buf = j & 1;
const event_t bufEvt = (buf == 0) ? EVENT_ID0 : EVENT_ID1;
// Launch the next chunk's gather so its TLOAD (MTE2) overlaps this
// chunk's GEMM (M) below.
if (j + 1 < FFN_AG_CHUNKS) {
const int nb = (j + 1) & 1;
const event_t nbEvt = (nb == 0) ? EVENT_ID0 : EVENT_ID1;
#ifndef __PTO_AUTO__
wait_flag(PIPE_MTE1, PIPE_MTE2, nbEvt);
#endif
comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[nb], (j + 1) * validChunk, validChunk,
validN);
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE2, PIPE_MTE1, nbEvt);
#endif
}
#ifndef __PTO_AUTO__
wait_flag(PIPE_MTE2, PIPE_MTE1, bufEvt);
#endif
TMOV(aDownT, hiddenChunkMat[buf]); // hidden K-band L1 -> L0A
TEXTRACT(bDownT, wDownMat, j * validChunk, 0); // W_down K-band L1 -> L0B
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE1, PIPE_MTE2, bufEvt);
#endif
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE1, PIPE_M, bufEvt);
wait_flag(PIPE_MTE1, PIPE_M, bufEvt);
#endif5a4c0b0 to
33b831c
Compare
Add a hardware-instruction-stream simulation pseudo-kernel under kernels/manual/a2a3/distributed_ffn_grid. The intent is to first provide a pseudo-code example that faithfully mirrors the real hardware instruction stream (data movement / compute / collective), using real PTO ISA names, signatures, tile types and data flow throughout. It models a unified single-core processing element (the matmul unit and vector unit share one L1): the mixed Cube/Vec A2/A3 kernel is rewritten into one straight-line instruction stream where the cube L1 and vector UB are fused into a single L1 address space, with no __DAV_CUBE__/__DAV_VEC__ split. Two memory domains are kept distinct: - Batcher mem — host I/O only: streams input X into each cell's L1 (TLOAD) and collects the final [T, Hc] result back out (TSTORE); weights load DRAM -> L1 via TLOAD. - NoC-interconnected L1 — the row-local hidden AllGather moves L1<->L1 across cells over the on-chip NoC and never round-trips Batcher mem: each cell publishes its [T, Fi] shard into a NoC-visible L1 SRAM window (TSTORE) and gathers peers' shards over the NoC. The Acc->Vec accumulator drain uses the ISA's only legal path, the C2V fixpipe, spelled as the directional TMOV(pipe, acc)/TMOV(vec, pipe). Adds an in-file comm::TAllGather that concatenates along the feature axis (rank r at feature offset r*shardCols) and gathers one fine-grained [T, chunkCols] band per call. The single [T, F] gather is split into FFN_AG_CHUNKS=4 bands; each gathered band is consumed by an accumulating down GEMM over its K-band, with the next band's gather (MTE2) issued before the current band's GEMM (M) so the collective hides under compute. SCOPE: high-fidelity pseudo-code, not a buildable dav-c220 object. The unified stream deliberately keeps TMATMUL and vlrelu/vmul/vconv side by side, so it does not compile for dav-c220 (the sub-cores have disjoint instruction sets). That is by design — the goal is a faithful real hardware instruction stream, not an A3 binary.
What
Adds a hardware-instruction-stream simulation pseudo-kernel under
kernels/manual/a2a3/distributed_ffn_grid/mock-distributed_ffn_grid_allgather_compute_kernel.cpp.The core goal of this WIP is to first provide a pseudo-code example that faithfully mirrors the real hardware instruction stream — data movement, compute, and collective communication — using real PTO ISA instruction names, signatures, tile types, and data flow throughout.
Design summary
It rewrites the original mixed Cube/Vec A2/A3 kernel into a unified single-L1, single instruction-stream high-fidelity pseudo-code, modeling one unified single-core processing element (the matmul unit and vector unit share one L1):
__DAV_CUBE__/__DAV_VEC__split — one straight-line stream; the cube L1 and vector UB are fused into a single L1 address space (onlykL1*, nokUb*).Xinto each cell's L1 (TLOAD) and collects the final[T, Hc]result back out (TSTORE). Weights load DRAM → L1 viaTLOAD.[T, Fi]hidden shard into a NoC-visible L1 SRAM window (TSTORE) and gathers peers' shards over the NoC.TMOV(pipe, acc)/TMOV(vec, pipe).comm::TAllGather+ fine-grained overlapAdds an in-file
comm::TAllGather(signature aligned with the library collectivesTGATHER/TREDUCE) that:rat feature offsetr*shardCols), folding in the re-layout thatTGATHER's row-stacking would otherwise force; and[T, chunkCols]band per call,TLOAD-ed over the NoC straight from the owning peer cell's L1 SRAM window into a matmul L1 tile (static shape, satisfying ND2NZ).The single
[T, F]AllGather is split intoFFN_AG_CHUNKS=4bands. Each gathered band is consumed by an accumulating down GEMM (TMATMUL→TMATMUL_ACC) over its K-band, with the next band's gather (MTE2) issued before the current band's GEMM (M), so the collective hides under compute (double-bufferedEVENT_ID0/EVENT_ID1).Instruction-stream overview
TLOADTLOADTMOVTMATMULTMOV(pipe,acc)/TMOV(vec,pipe)TLRELU/TMUL/TCVTTSTORETNOTIFY/TWAITcomm::TAllGatherTMOV/TEXTRACT+TMATMUL_ACCTMOVTSTOREScope / notes
TMATMULandvlrelu/vmul/vconvside by side, so it does not compile fordav-c220(Ascend 910B) because the cube (AIC) and vector (AIV) sub-cores have disjoint instruction sets and are each compiled once. That is by design — the goal is a faithful real-hardware instruction stream, not an A3 binary.(void)) since intermediates now reside in L1.