Skip to content

[WIP] mock: high-fidelity single-stream AllGather FFN pseudo-kernel#187

Open
chenshengxin2026 wants to merge 1 commit into
hw-native-sys:mainfrom
chenshengxin2026:mock/distributed-ffn-grid-allgather-single-stream
Open

[WIP] mock: high-fidelity single-stream AllGather FFN pseudo-kernel#187
chenshengxin2026 wants to merge 1 commit into
hw-native-sys:mainfrom
chenshengxin2026:mock/distributed-ffn-grid-allgather-single-stream

Conversation

@chenshengxin2026

@chenshengxin2026 chenshengxin2026 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

Adds a hardware-instruction-stream simulation pseudo-kernel under
kernels/manual/a2a3/distributed_ffn_grid/mock-distributed_ffn_grid_allgather_compute_kernel.cpp.

The core goal of this WIP is to first provide a pseudo-code example that faithfully mirrors the real hardware instruction stream — data movement, compute, and collective communication — using real PTO ISA instruction names, signatures, tile types, and data flow throughout.

Design summary

It rewrites the original mixed Cube/Vec A2/A3 kernel into a unified single-L1, single instruction-stream high-fidelity pseudo-code, modeling one unified single-core processing element (the matmul unit and vector unit share one L1):

  • No __DAV_CUBE__/__DAV_VEC__ split — one straight-line stream; the cube L1 and vector UB are fused into a single L1 address space (only kL1*, no kUb*).
  • Two memory domains kept distinct:
    • Batcher mem — host I/O only: streams input X into each cell's L1 (TLOAD) and collects the final [T, Hc] result back out (TSTORE). Weights load DRAM → L1 via TLOAD.
    • NoC-interconnected L1 — the row-local hidden AllGather moves L1↔L1 across cells over the on-chip NoC and never round-trips Batcher mem. Each cell publishes its [T, Fi] hidden shard into a NoC-visible L1 SRAM window (TSTORE) and gathers peers' shards over the NoC.
  • Acc→Vec drain uses the ISA's only legal path — the C2V fixpipe — spelled as the directional TMOV(pipe, acc) / TMOV(vec, pipe).

comm::TAllGather + fine-grained overlap

Adds an in-file comm::TAllGather (signature aligned with the library collectives TGATHER/TREDUCE) that:

  • concatenates along the feature axis (rank r at feature offset r*shardCols), folding in the re-layout that TGATHER's row-stacking would otherwise force; and
  • gathers one fine-grained [T, chunkCols] band per call, TLOAD-ed over the NoC straight from the owning peer cell's L1 SRAM window into a matmul L1 tile (static shape, satisfying ND2NZ).

The single [T, F] AllGather is split into FFN_AG_CHUNKS=4 bands. Each gathered band is consumed by an accumulating down GEMM (TMATMULTMATMUL_ACC) over its K-band, with the next band's gather (MTE2) issued before the current band's GEMM (M), so the collective hides under compute (double-buffered EVENT_ID0/EVENT_ID1).

Instruction-stream overview

Stage Instruction Direction
Load input X TLOAD Batcher mem → L1
Load weights TLOAD DRAM → L1
gate/up feed TMOV L1 → L0A/L0B
gate/up matmul TMATMUL
Acc → vector TMOV(pipe,acc) / TMOV(vec,pipe) L0C → L1 fixpipe
PReLU·mul·cvt TLRELU/TMUL/TCVT on L1
Publish hidden shard TSTORE L1 → NoC-visible L1 SRAM window
Arrival barrier TNOTIFY/TWAIT on-chip NoC barrier
Gather peer shards (×4) comm::TAllGather peer L1 SRAM (NoC) → L1
down GEMM (×4 accumulate) TMOV/TEXTRACT + TMATMUL_ACC
Drain down accumulator TMOV L0C → L1
Write output shard TSTORE L1 → Batcher mem

Scope / notes

  • This is high-fidelity pseudo-code, not a buildable dav-c220 object. The unified stream deliberately keeps TMATMUL and vlrelu/vmul/vconv side by side, so it does not compile for dav-c220 (Ascend 910B) because the cube (AIC) and vector (AIV) sub-cores have disjoint instruction sets and are each compiled once. That is by design — the goal is a faithful real-hardware instruction stream, not an A3 binary.
  • Kernel signature / launch is unchanged (drop-in, 15 params); some GM staging params are intentionally left unused ((void)) since intermediates now reside in L1.
  • WIP: this is an initial pseudo-code reference for review of the instruction-stream modeling; not yet wired into build/test.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a high-fidelity pseudo-kernel for a single-PE FFN cell with an AllGather split variant, utilizing a unified L1 address space to overlap communication and computation. The review feedback identifies a Write-After-Read (WAR) hazard in the double-buffering pipeline of hiddenChunkMat where asynchronous TAllGather operations can overwrite buffers before they are read by TMOV. A synchronization mechanism using PIPE_MTE1 -> PIPE_MTE2 flags is suggested to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +492 to +523
// Prime: gather chunk 0 into buffer 0.
comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[0], /*featOffset=*/0, validChunk, validN);
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
#endif

for (int j = 0; j < FFN_AG_CHUNKS; ++j) {
const int buf = j & 1;
const event_t bufEvt = (buf == 0) ? EVENT_ID0 : EVENT_ID1;

// Launch the next chunk's gather so its TLOAD (MTE2) overlaps this
// chunk's GEMM (M) below.
if (j + 1 < FFN_AG_CHUNKS) {
const int nb = (j + 1) & 1;
const event_t nbEvt = (nb == 0) ? EVENT_ID0 : EVENT_ID1;
comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[nb], (j + 1) * validChunk, validChunk,
validN);
#ifndef __PTO_AUTO__
set_flag(PIPE_MTE2, PIPE_MTE1, nbEvt);
#endif
}

#ifndef __PTO_AUTO__
wait_flag(PIPE_MTE2, PIPE_MTE1, bufEvt);
#endif
TMOV(aDownT, hiddenChunkMat[buf]); // hidden K-band L1 -> L0A
TEXTRACT(bDownT, wDownMat, j * validChunk, 0); // W_down K-band L1 -> L0B

#ifndef __PTO_AUTO__
set_flag(PIPE_MTE1, PIPE_M, bufEvt);
wait_flag(PIPE_MTE1, PIPE_M, bufEvt);
#endif

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current double-buffering implementation for hiddenChunkMat has a Write-After-Read (WAR) hazard. While PIPE_MTE2 -> PIPE_MTE1 synchronization is present (ensuring that TAllGather completes before TMOV reads the buffer), there is no reverse PIPE_MTE1 -> PIPE_MTE2 synchronization.

Without this reverse synchronization, the asynchronous TAllGather (MTE2) in iteration j + 1 can overwrite hiddenChunkMat[nb] before the TMOV (MTE1) in iteration j has finished reading it.

To resolve this:

  1. Pre-set EVENT_ID1 on PIPE_MTE1 -> PIPE_MTE2 before the loop (since buffer 1 is initially empty and safe to write to).
  2. Before launching TAllGather for the next chunk, wait for the corresponding event on PIPE_MTE1 -> PIPE_MTE2.
  3. After TMOV finishes reading hiddenChunkMat[buf], set the corresponding event on PIPE_MTE1 -> PIPE_MTE2 to signal that the buffer is safe to reuse.
    // Prime: gather chunk 0 into buffer 0.
#ifndef __PTO_AUTO__
    set_flag(PIPE_MTE1, PIPE_MTE2, EVENT_ID1);
#endif
    comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[0], /*featOffset=*/0, validChunk, validN);
#ifndef __PTO_AUTO__
    set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
#endif

    for (int j = 0; j < FFN_AG_CHUNKS; ++j) {
        const int buf = j & 1;
        const event_t bufEvt = (buf == 0) ? EVENT_ID0 : EVENT_ID1;

        // Launch the next chunk's gather so its TLOAD (MTE2) overlaps this
        // chunk's GEMM (M) below.
        if (j + 1 < FFN_AG_CHUNKS) {
            const int nb = (j + 1) & 1;
            const event_t nbEvt = (nb == 0) ? EVENT_ID0 : EVENT_ID1;
#ifndef __PTO_AUTO__
            wait_flag(PIPE_MTE1, PIPE_MTE2, nbEvt);
#endif
            comm::TAllGather<comm::CollEngine::AIV>(rowGroup, hiddenChunkMat[nb], (j + 1) * validChunk, validChunk,
                                                    validN);
#ifndef __PTO_AUTO__
            set_flag(PIPE_MTE2, PIPE_MTE1, nbEvt);
#endif
        }

#ifndef __PTO_AUTO__
        wait_flag(PIPE_MTE2, PIPE_MTE1, bufEvt);
#endif
        TMOV(aDownT, hiddenChunkMat[buf]);             // hidden K-band  L1 -> L0A
        TEXTRACT(bDownT, wDownMat, j * validChunk, 0); // W_down K-band  L1 -> L0B
#ifndef __PTO_AUTO__
        set_flag(PIPE_MTE1, PIPE_MTE2, bufEvt);
#endif

#ifndef __PTO_AUTO__
        set_flag(PIPE_MTE1, PIPE_M, bufEvt);
        wait_flag(PIPE_MTE1, PIPE_M, bufEvt);
#endif

@chenshengxin2026 chenshengxin2026 force-pushed the mock/distributed-ffn-grid-allgather-single-stream branch from 5a4c0b0 to 33b831c Compare July 1, 2026 08:31
Add a hardware-instruction-stream simulation pseudo-kernel under
kernels/manual/a2a3/distributed_ffn_grid. The intent is to first provide
a pseudo-code example that faithfully mirrors the real hardware
instruction stream (data movement / compute / collective), using real
PTO ISA names, signatures, tile types and data flow throughout.

It models a unified single-core processing element (the matmul unit and
vector unit share one L1): the mixed Cube/Vec A2/A3 kernel is rewritten
into one straight-line instruction stream where the cube L1 and vector
UB are fused into a single L1 address space, with no
__DAV_CUBE__/__DAV_VEC__ split. Two memory domains are kept distinct:

- Batcher mem — host I/O only: streams input X into each cell's L1
  (TLOAD) and collects the final [T, Hc] result back out (TSTORE);
  weights load DRAM -> L1 via TLOAD.
- NoC-interconnected L1 — the row-local hidden AllGather moves L1<->L1
  across cells over the on-chip NoC and never round-trips Batcher mem:
  each cell publishes its [T, Fi] shard into a NoC-visible L1 SRAM
  window (TSTORE) and gathers peers' shards over the NoC.

The Acc->Vec accumulator drain uses the ISA's only legal path, the C2V
fixpipe, spelled as the directional TMOV(pipe, acc)/TMOV(vec, pipe).

Adds an in-file comm::TAllGather that concatenates along the feature
axis (rank r at feature offset r*shardCols) and gathers one fine-grained
[T, chunkCols] band per call. The single [T, F] gather is split into
FFN_AG_CHUNKS=4 bands; each gathered band is consumed by an accumulating
down GEMM over its K-band, with the next band's gather (MTE2) issued
before the current band's GEMM (M) so the collective hides under compute.

SCOPE: high-fidelity pseudo-code, not a buildable dav-c220 object. The
unified stream deliberately keeps TMATMUL and vlrelu/vmul/vconv side by
side, so it does not compile for dav-c220 (the sub-cores have disjoint
instruction sets). That is by design — the goal is a faithful real
hardware instruction stream, not an A3 binary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant