add non_all_reduce version of cluster reduction #5319

liqiangxl · 2025-10-03T15:02:26Z

add non_all_reduce version of cluster reduction

Usage: The 2nd reduction in cross entropy loss to compute log-sum-exp doesn't require all reduce.
Algorithm:
(1) All warps do warp reduce and write result to last block's shared memory
(2) last block waits until all data are received, then its warp-0 finish the reduction.
(3) After reduction warp-0 in last block of this cluster has the valid result.
last block is used instead of first block to keep consistent with grid reduction.

github-actions · 2025-10-03T15:04:57Z

Review updated until commit 5c6dd06

Description

Add non-all-reduce cluster reduction support
Update validation logic for scalar outputs
Extend tests for both all-reduce and reduce
Refactor cluster reduction kernel for flexibility

Changes walkthrough 📝

Relevant files

Enhancement

4 files

codegen.cpp `Pass is_all_reduce flag to template args`	+2/-0
cluster.cu `Unified clusterReduce with is_all_reduce`	+132/-35
cluster_test_kernels.cu `Template kernel for all/reduce modes`	+45/-15
cluster_test_helper.h `Update launch function signature`	+4/-2

Bug fix

2 files

index.cpp `Remove all-reduce restriction in lowering`	+1/-5
kernel_ir.cpp `Remove is_all_reduce assertion constraint`	+0/-1

Tests

3 files

cluster_test_helper.cpp `Add is_all_reduce support in validation`	+32/-14
test_cluster_device_func.cpp `Add non-all-reduce test cases`	+51/-6
test_cluster.cpp `Add SimpleFusionNotAllReduce test`	+51/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Barrier Setup The setupBarrierExpectTX function is called conditionally in the reduce path based on my_block_rank, but only by warp 0. This could lead to inconsistent barrier initialization if not all participating blocks properly set up the barrier. setupBarrierExpectTX<cluster_size, warps_per_block, T>( barrier_smem_addr, warp_idx); } Output Handling In the clusterReduceTestKernel, when is_all_reduce is false, only the first thread of the last block writes to output[0]. This assumes that the output tensor has at least one element, but this is not validated. output[0] = result;

liqiangxl · 2025-10-03T15:04:58Z

!test

liqiangxl · 2025-10-03T16:27:25Z

!test

liqiangxl · 2025-10-03T18:25:05Z

!test

liqiangxl · 2025-10-05T13:38:18Z

!test

naoyam · 2025-10-06T17:30:54Z

runtime/cluster.cu

+template <
+    int CLUSTER_SIZE,
+    int WARPS_PER_BLOCK,
+    bool is_all_reduce,


nit: I know the runtime functions don't follow the style guide very well, but this just looks inconsistent as the first two are all caps. Looks like the guide suggests to use the same naming as normal parameters, so we should probably use cluster_size and warps_per_block.

naoyam · 2025-10-06T17:37:54Z

Can you make sure a proper predicate is generated for the output of the reduction? We need to generate a predicate that masks off all blocks except the last one. There's logic for grid reductions, and I think that should just work since this also looks like a grid reduction.

liqiangxl · 2025-10-07T13:19:53Z

Can you make sure a proper predicate is generated for the output of the reduction? We need to generate a predicate that masks off all blocks except the last one. There's logic for grid reductions, and I think that should just work since this also looks like a grid reduction.

Yes, output is correctly predicated. Initially I used block-0 to finish the reduction, test captures error and I switched to use last block to reuse the predicate.

naoyam

LGTM

liqiangxl · 2025-10-07T17:25:24Z

!test

add non_all_reduce version of cluster reduction

6ed1427

clean

7cad2f3

use last block

d8b5c0f

Merge branch 'main' into llu/add_non_all_reduce_cluster_reduction

379d472

liqiangxl marked this pull request as ready for review October 6, 2025 13:02

liqiangxl requested review from jacobhinkle and naoyam October 6, 2025 13:02

naoyam reviewed Oct 6, 2025

View reviewed changes

revise template para

535e4e2

naoyam approved these changes Oct 7, 2025

View reviewed changes

Merge branch 'main' into llu/add_non_all_reduce_cluster_reduction

5c6dd06

liqiangxl merged commit 7e66407 into main Oct 8, 2025
55 checks passed

liqiangxl deleted the llu/add_non_all_reduce_cluster_reduction branch October 8, 2025 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add non_all_reduce version of cluster reduction #5319

add non_all_reduce version of cluster reduction #5319

Uh oh!

liqiangxl commented Oct 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

liqiangxl commented Oct 3, 2025

Uh oh!

liqiangxl commented Oct 3, 2025

Uh oh!

liqiangxl commented Oct 3, 2025

Uh oh!

liqiangxl commented Oct 5, 2025

Uh oh!

naoyam Oct 6, 2025

Uh oh!

liqiangxl Oct 7, 2025

Uh oh!

naoyam commented Oct 6, 2025

Uh oh!

liqiangxl commented Oct 7, 2025

Uh oh!

naoyam left a comment

Uh oh!

liqiangxl commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add non_all_reduce version of cluster reduction #5319

add non_all_reduce version of cluster reduction #5319

Uh oh!

Conversation

liqiangxl commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!