[WIP] mock: high-fidelity single-stream AllGather FFN pseudo-kernel by chenshengxin2026 · Pull Request #187 · hw-native-sys/pto-isa

chenshengxin2026 · 2026-07-01T08:24:16Z

What

Adds a hardware-instruction-stream simulation pseudo-kernel under
kernels/manual/a2a3/distributed_ffn_grid/mock-distributed_ffn_grid_allgather_compute_kernel.cpp.

The core goal of this WIP is to first provide a pseudo-code example that faithfully mirrors the real hardware instruction stream — data movement, compute, and collective communication — using real PTO ISA instruction names, signatures, tile types, and data flow throughout.

Design summary

It rewrites the original mixed Cube/Vec A2/A3 kernel into a unified single-L1, single instruction-stream high-fidelity pseudo-code, modeling one unified single-core processing element (the matmul unit and vector unit share one L1):

No __DAV_CUBE__/__DAV_VEC__ split — one straight-line stream; the cube L1 and vector UB are fused into a single L1 address space (only kL1*, no kUb*).
Two memory domains kept distinct:
- Batcher mem — host I/O only: streams input X into each cell's L1 (TLOAD) and collects the final [T, Hc] result back out (TSTORE). Weights load DRAM → L1 via TLOAD.
- NoC-interconnected L1 — the row-local hidden AllGather moves L1↔L1 across cells over the on-chip NoC and never round-trips Batcher mem. Each cell publishes its [T, Fi] hidden shard into a NoC-visible L1 SRAM window (TSTORE) and gathers peers' shards over the NoC.
Acc→Vec drain uses the ISA's only legal path — the C2V fixpipe — spelled as the directional TMOV(pipe, acc) / TMOV(vec, pipe).

`comm::TAllGather` + fine-grained overlap

Adds an in-file comm::TAllGather (signature aligned with the library collectives TGATHER/TREDUCE) that:

concatenates along the feature axis (rank r at feature offset r*shardCols), folding in the re-layout that TGATHER's row-stacking would otherwise force; and
gathers one fine-grained [T, chunkCols] band per call, TLOAD-ed over the NoC straight from the owning peer cell's L1 SRAM window into a matmul L1 tile (static shape, satisfying ND2NZ).

The single [T, F] AllGather is split into FFN_AG_CHUNKS=4 bands. Each gathered band is consumed by an accumulating down GEMM (TMATMUL → TMATMUL_ACC) over its K-band, with the next band's gather (MTE2) issued before the current band's GEMM (M), so the collective hides under compute (double-buffered EVENT_ID0/EVENT_ID1).

Instruction-stream overview

Stage	Instruction	Direction
Load input X	`TLOAD`	Batcher mem → L1
Load weights	`TLOAD`	DRAM → L1
gate/up feed	`TMOV`	L1 → L0A/L0B
gate/up matmul	`TMATMUL`	—
Acc → vector	`TMOV(pipe,acc)` / `TMOV(vec,pipe)`	L0C → L1 fixpipe
PReLU·mul·cvt	`TLRELU`/`TMUL`/`TCVT`	on L1
Publish hidden shard	`TSTORE`	L1 → NoC-visible L1 SRAM window
Arrival barrier	`TNOTIFY`/`TWAIT`	on-chip NoC barrier
Gather peer shards (×4)	`comm::TAllGather`	peer L1 SRAM (NoC) → L1
down GEMM (×4 accumulate)	`TMOV`/`TEXTRACT` + `TMATMUL_ACC`	—
Drain down accumulator	`TMOV`	L0C → L1
Write output shard	`TSTORE`	L1 → Batcher mem

Scope / notes

This is high-fidelity pseudo-code, not a buildable dav-c220 object. The unified stream deliberately keeps TMATMUL and vlrelu/vmul/vconv side by side, so it does not compile for dav-c220 (Ascend 910B) because the cube (AIC) and vector (AIV) sub-cores have disjoint instruction sets and are each compiled once. That is by design — the goal is a faithful real-hardware instruction stream, not an A3 binary.
Kernel signature / launch is unchanged (drop-in, 15 params); some GM staging params are intentionally left unused ((void)) since intermediates now reside in L1.
WIP: this is an initial pseudo-code reference for review of the instruction-stream modeling; not yet wired into build/test.

gemini-code-assist

Code Review

This pull request introduces a high-fidelity pseudo-kernel for a single-PE FFN cell with an AllGather split variant, utilizing a unified L1 address space to overlap communication and computation. The review feedback identifies a Write-After-Read (WAR) hazard in the double-buffering pipeline of hiddenChunkMat where asynchronous TAllGather operations can overwrite buffers before they are read by TMOV. A synchronization mechanism using PIPE_MTE1 -> PIPE_MTE2 flags is suggested to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-01T08:26:24Z

+    // Prime: gather chunk 0 into buffer 0.
+    comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[0], /*featOffset=*/0, validChunk, validN);
+#ifndef __PTO_AUTO__
+    set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
+#endif
+
+    for (int j = 0; j < FFN_AG_CHUNKS; ++j) {
+        const int buf = j & 1;
+        const event_t bufEvt = (buf == 0) ? EVENT_ID0 : EVENT_ID1;
+
+        // Launch the next chunk's gather so its TLOAD (MTE2) overlaps this
+        // chunk's GEMM (M) below.
+        if (j + 1 < FFN_AG_CHUNKS) {
+            const int nb = (j + 1) & 1;
+            const event_t nbEvt = (nb == 0) ? EVENT_ID0 : EVENT_ID1;
+            comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[nb], (j + 1) * validChunk, validChunk,
+                                                    validN);
+#ifndef __PTO_AUTO__
+            set_flag(PIPE_MTE2, PIPE_MTE1, nbEvt);
+#endif
+        }
+
+#ifndef __PTO_AUTO__
+        wait_flag(PIPE_MTE2, PIPE_MTE1, bufEvt);
+#endif
+        TMOV(aDownT, hiddenChunkMat[buf]);             // hidden K-band  L1 -> L0A
+        TEXTRACT(bDownT, wDownMat, j * validChunk, 0); // W_down K-band  L1 -> L0B
+
+#ifndef __PTO_AUTO__
+        set_flag(PIPE_MTE1, PIPE_M, bufEvt);
+        wait_flag(PIPE_MTE1, PIPE_M, bufEvt);
+#endif


The current double-buffering implementation for hiddenChunkMat has a Write-After-Read (WAR) hazard. While PIPE_MTE2 -> PIPE_MTE1 synchronization is present (ensuring that TAllGather completes before TMOV reads the buffer), there is no reverse PIPE_MTE1 -> PIPE_MTE2 synchronization.

Without this reverse synchronization, the asynchronous TAllGather (MTE2) in iteration j + 1 can overwrite hiddenChunkMat[nb] before the TMOV (MTE1) in iteration j has finished reading it.

To resolve this:

Pre-set EVENT_ID1 on PIPE_MTE1 -> PIPE_MTE2 before the loop (since buffer 1 is initially empty and safe to write to).

Before launching TAllGather for the next chunk, wait for the corresponding event on PIPE_MTE1 -> PIPE_MTE2.

After TMOV finishes reading hiddenChunkMat[buf], set the corresponding event on PIPE_MTE1 -> PIPE_MTE2 to signal that the buffer is safe to reuse.

// Prime: gather chunk 0 into buffer 0. #ifndef __PTO_AUTO__ set_flag(PIPE_MTE1, PIPE_MTE2, EVENT_ID1); #endif comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[0], /*featOffset=*/0, validChunk, validN); #ifndef __PTO_AUTO__ set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0); #endif for (int j = 0; j < FFN_AG_CHUNKS; ++j) { const int buf = j & 1; const event_t bufEvt = (buf == 0) ? EVENT_ID0 : EVENT_ID1; // Launch the next chunk's gather so its TLOAD (MTE2) overlaps this // chunk's GEMM (M) below. if (j + 1 < FFN_AG_CHUNKS) { const int nb = (j + 1) & 1; const event_t nbEvt = (nb == 0) ? EVENT_ID0 : EVENT_ID1; #ifndef __PTO_AUTO__ wait_flag(PIPE_MTE1, PIPE_MTE2, nbEvt); #endif comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[nb], (j + 1) * validChunk, validChunk, validN); #ifndef __PTO_AUTO__ set_flag(PIPE_MTE2, PIPE_MTE1, nbEvt); #endif } #ifndef __PTO_AUTO__ wait_flag(PIPE_MTE2, PIPE_MTE1, bufEvt); #endif TMOV(aDownT, hiddenChunkMat[buf]); // hidden K-band L1 -> L0A TEXTRACT(bDownT, wDownMat, j * validChunk, 0); // W_down K-band L1 -> L0B #ifndef __PTO_AUTO__ set_flag(PIPE_MTE1, PIPE_MTE2, bufEvt); #endif #ifndef __PTO_AUTO__ set_flag(PIPE_MTE1, PIPE_M, bufEvt); wait_flag(PIPE_MTE1, PIPE_M, bufEvt); #endif

Add a hardware-instruction-stream simulation pseudo-kernel under kernels/manual/a2a3/distributed_ffn_grid. The intent is to first provide a pseudo-code example that faithfully mirrors the real hardware instruction stream (data movement / compute / collective), using real PTO ISA names, signatures, tile types and data flow throughout. It models a unified single-core processing element (the matmul unit and vector unit share one L1): the mixed Cube/Vec A2/A3 kernel is rewritten into one straight-line instruction stream where the cube L1 and vector UB are fused into a single L1 address space, with no __DAV_CUBE__/__DAV_VEC__ split. Two memory domains are kept distinct: - Batcher mem — host I/O only: streams input X into each cell's L1 (TLOAD) and collects the final [T, Hc] result back out (TSTORE); weights load DRAM -> L1 via TLOAD. - NoC-interconnected L1 — the row-local hidden AllGather moves L1<->L1 across cells over the on-chip NoC and never round-trips Batcher mem: each cell publishes its [T, Fi] shard into a NoC-visible L1 SRAM window (TSTORE) and gathers peers' shards over the NoC. The Acc->Vec accumulator drain uses the ISA's only legal path, the C2V fixpipe, spelled as the directional TMOV(pipe, acc)/TMOV(vec, pipe). Adds an in-file comm::TAllGather that concatenates along the feature axis (rank r at feature offset r*shardCols) and gathers one fine-grained [T, chunkCols] band per call. The single [T, F] gather is split into FFN_AG_CHUNKS=4 bands; each gathered band is consumed by an accumulating down GEMM over its K-band, with the next band's gather (MTE2) issued before the current band's GEMM (M) so the collective hides under compute. SCOPE: high-fidelity pseudo-code, not a buildable dav-c220 object. The unified stream deliberately keeps TMATMUL and vlrelu/vmul/vconv side by side, so it does not compile for dav-c220 (the sub-cores have disjoint instruction sets). That is by design — the goal is a faithful real hardware instruction stream, not an A3 binary.

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

chenshengxin2026 force-pushed the mock/distributed-ffn-grid-allgather-single-stream branch from 5a4c0b0 to 33b831c Compare July 1, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] mock: high-fidelity single-stream AllGather FFN pseudo-kernel#187

[WIP] mock: high-fidelity single-stream AllGather FFN pseudo-kernel#187
chenshengxin2026 wants to merge 1 commit into
hw-native-sys:mainfrom
chenshengxin2026:mock/distributed-ffn-grid-allgather-single-stream

chenshengxin2026 commented Jul 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chenshengxin2026 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Design summary

comm::TAllGather + fine-grained overlap

Instruction-stream overview

Scope / notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenshengxin2026 commented Jul 1, 2026 •

edited

Loading

`comm::TAllGather` + fine-grained overlap