Skip to content

update(distributed): convert all_to_all distributed to push-based lowering#1267

Open
georgebisbas wants to merge 3 commits into
hw-native-sys:mainfrom
georgebisbas:perf/all-to-all-push-based
Open

update(distributed): convert all_to_all distributed to push-based lowering#1267
georgebisbas wants to merge 3 commits into
hw-native-sys:mainfrom
georgebisbas:perf/all-to-all-push-based

Conversation

@georgebisbas

@georgebisbas georgebisbas commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace the 3-phase pull-based all_to_all kernel (stage-in → barrier → remote TLOAD) with a 2-phase push-based kernel (TPUT to peers → barrier → local copy-out), mirroring the push-based approach proposed in pypto#1937.

Benefits

  • Fewer phases: 2 kernel phases instead of 3 (drops the separate stage-in loop).
  • No remote reads: Phase 3 is purely local TLOAD/TSTORE — no CommRemotePtr cross-rank reads needed.
  • Fewer tiles: Single pushTile reused for both push and copy-out (was two tiles: stageTile + recvTile).
  • No post-exchange barrier: Push model has no WAR hazard — input is read-only, scratch is write-only from peers.
  • Simpler code: The kernel is 26 lines shorter and the data flow is more linear.

What changed

File Change
kernels/aiv/all_to_all_kernel.cpp Rewrite: 3-phase pull → 2-phase push using pto::comm::TPUT
kernels/orchestration/all_to_all_orch.cpp Updated comments
main.py Updated docstring and comments
all_to_all_distributed/README.md Updated algorithm description
l3/README.md Updated row description

How it works

Phase 1 (push):  for dest in 0..NR-1:
                   TPUT input[dest*C..] → peer dest's scratch[my_rank*C..]
                 Self-rank: CommRemotePtr to self → local pointer → TPUT is local write
Phase 2 (barrier): TNOTIFY all, TWAIT all (unchanged)
Phase 3 (copy-out): for src in 0..NR-1:
                      TLOAD(local scratch[src*C..]) → TSTORE(output[src*C..])
                    Purely local — scratch already holds the result

The push primitive pto::comm::TPUT is already proven in ep_dispatch_combine/kernels/aiv/combine.cpp. CPU_SIM path uses a manual element loop since TPUT is not available on the simulator.

Verification

  • Pre-commit: all 13 hooks pass (check-headers, cpplint, clang-format, ruff, pyright, markdownlint, etc.)
  • Sim P=2: max |out - expected| = 0.000e+00
  • Sim P=4: all ranks matched golden ✅
  • pytest: test_all_to_all_distributed[2] + [4] PASS
  • Hardware P=2, P=4 (pending)

References

  • pypto#1937: push-based all_to_all proposal by @YunjiQin
  • ep_dispatch_combine/kernels/aiv/combine.cpp: existing TPUT usage pattern

Replace the 3-phase pull-based all_to_all kernel (stage-in →
barrier → remote TLOAD) with a 2-phase push-based kernel (TPUT to
peers → barrier → local copy-out).

- Phase 1: TPUT each input chunk directly to the corresponding peer's
  scratch at offset my_rank*C via pto::comm::TPUT. Self-rank handled
  naturally — CommRemotePtr to self returns the local pointer.
- Phase 2: unchanged notify/wait barrier.
- Phase 3: purely local TLOAD from scratch → TSTORE to output
  (no CommRemotePtr needed here — scratch already holds the result).
- Single pushTile reused for both push and copy-out phases (was two
  tiles: stageTile + recvTile). No post-exchange barrier needed
  (no WAR hazard — input read-only, scratch write-only from peers).
- CPU_SIM fallback: manual element loop instead of TPUT.

Scratch naming preserved for consistency with all other L3 examples
(allreduce, allgather, broadcast, reduce_scatter).

Sim-verified: P=2 and P=4 golden checks pass (0.000e+00 max diff).
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e8a5add3-afe3-49d0-9750-a3c428c4774a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The all-to-all distributed example is refactored from a symmetric 3-phase stage-in/exchange pattern to a symmetric 2-phase push-based pattern. The kernel now performs direct TPUT pushes into peer scratch, a barrier via signal notify/wait, and local copy-out. Documentation and comments across README files, orchestration shim, and main.py are updated accordingly.

Changes

2-Phase Push Refactor

Layer / File(s) Summary
Kernel push and copy-out implementation
examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp
Phase 1 replaces staged TLOAD/TSTORE with event flags by direct per-destination TPUT pushes via CommRemotePtr into peer scratch; Phase 3 replaces remote-read exchange with local copy-out from scratch using a reused pushTile.
Orchestration and documentation updates
examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp, examples/workers/l3/all_to_all_distributed/README.md, examples/workers/l3/README.md, examples/workers/l3/all_to_all_distributed/main.py
Comments and documentation are updated to describe the push-based 2-phase (push, barrier, copy-out) pattern in place of the prior 3-phase mesh/staging description, including scratch buffer semantics.

Estimated code review effort: 2 (Simple) | ~12 minutes

Possibly related PRs

  • hw-native-sys/simpler#888: Prior implementation of the same all-to-all kernel and orchestration codepaths using the 3-phase stage/exchange approach that this PR replaces.

Poem

A rabbit push, no need to wait,
Straight to your burrow, chunk by chunk, direct fate. 🐇
Barrier bells ring, all ranks in tune,
Then copy-out home beneath the moon.
Two phases now where three once stood—
Simpler warrens, understood!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change from pull-based to push-based all_to_all lowering.
Description check ✅ Passed The description is directly related to the changeset and accurately describes the push-based all_to_all rewrite.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the distributed all-to-all example to use a more efficient 2-phase push-based algorithm (push, barrier, and local copy-out) instead of the previous 3-phase pattern. Feedback on these changes includes fixing a markdown formatting typo in the README and optimizing the kernel loop by hoisting a loop-invariant pointer calculation outside of the loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread examples/workers/l3/all_to_all_distributed/README.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/workers/l3/all_to_all_distributed/README.md (1)

39-43: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Malformed code fence and stale kernel description.

Line 39 concatenates the fence language tag with content: textpush, barrier, copy-out instead of ```` text ```` on its own line — this breaks the code fence/language identifier. It looks like push, barrier, copy-out was intended to replace the stale `(stage-in, barrier, exchange)` description of the kernel on Line 43, which still describes the old 3-phase design.

📝 Proposed fix
-```textpush, barrier, copy-out
+```text
 all_to_all_distributed/
 ├── kernels/
 │   ├── aiv/
-│   │   └── all_to_all_kernel.cpp     # AIV kernel (stage-in, barrier, exchange)
+│   │   └── all_to_all_kernel.cpp     # AIV kernel (push, barrier, copy-out)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/workers/l3/all_to_all_distributed/README.md` around lines 39 - 43,
Fix the malformed fenced block in the all_to_all_distributed README and update
the stale kernel phase description. In the README’s tree snippet, separate the
code fence language tag from the content so the fence starts correctly as a
standalone text block, and update the all_to_all_kernel.cpp comment to use the
current “push, barrier, copy-out” wording instead of the old “stage-in, barrier,
exchange” description. Use the README tree section and the all_to_all_kernel.cpp
entry as the anchors when editing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@examples/workers/l3/all_to_all_distributed/README.md`:
- Around line 39-43: Fix the malformed fenced block in the
all_to_all_distributed README and update the stale kernel phase description. In
the README’s tree snippet, separate the code fence language tag from the content
so the fence starts correctly as a standalone text block, and update the
all_to_all_kernel.cpp comment to use the current “push, barrier, copy-out”
wording instead of the old “stage-in, barrier, exchange” description. Use the
README tree section and the all_to_all_kernel.cpp entry as the anchors when
editing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e8bf3e22-0c52-406a-b6ea-86964d314f96

📥 Commits

Reviewing files that changed from the base of the PR and between fb65fe9 and 301fcde.

📒 Files selected for processing (5)
  • examples/workers/l3/README.md
  • examples/workers/l3/all_to_all_distributed/README.md
  • examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp
  • examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp
  • examples/workers/l3/all_to_all_distributed/main.py

The pto-isa CPU simulator now has native TPUT support (TPut.hpp, via
Copy_Data wrapper). The manual element-loop fallbacks guarded by
__CPU_SIM are no longer needed — pto::comm::TPUT works on both
hardware and simulator paths.

- all_to_all_kernel.cpp: 1 guard removed
- combine.cpp: 1 guard removed
- dispatch.cpp: 3 guards removed
- kernel_allreduce_sum.cpp: 1 guard removed

Sim-verified: all_to_all (P=2, P=4), ffn_tp_parallel all pass.
ep_dispatch_combine golden check is a pre-existing failure (fails on
main too).
@georgebisbas georgebisbas changed the title Update: convert all_to_all distributed to push-based lowering update(distributed): convert all_to_all distributed to push-based lowering Jul 3, 2026
- Fix markdown code block language specifier typo in README.md
  (textpush,barrier,copy-out -> text)
- Hoist loop-invariant scratch_dst_local pointer outside the
  Phase 1 loop (gemini-code-assist review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant