Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scalar reduction codegen schedule #1284

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

Yancey1989
Copy link
Collaborator

@Yancey1989 Yancey1989 commented Mar 1, 2024

add scalar-reduction codegen template , the algorithm comes from https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

@Yancey1989 Yancey1989 changed the title [WIP]support scalar reduction support scalar reduction Mar 8, 2024
eedalong
eedalong previously approved these changes Mar 12, 2024
Copy link
Collaborator

@eedalong eedalong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Yancey1989 Yancey1989 changed the title support scalar reduction Add scalar reduction codegen schedule Mar 20, 2024
@eedalong eedalong self-requested a review March 22, 2024 02:08
* shm[tid] += inputs[j] + inputs[j + block_size];
* }
* __syncthreads();
* for (int stride = block_size / 2; stride > 0; stride /= 2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here missing the logic of warpReduce.

}
{
SmallVector<Value, 4> init_values = {};
for (int stride = 128; stride > 16; stride /= 2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warp_size=32, it is better to set the stop condition to stride > 32

b.create<memref::LoadOp>(loc, shared_mem_map[root_op], strid_tid);
Value sum = accum_factory[idx](shm_val_1, shm_val_2);
b.create<memref::StoreOp>(loc, sum, shared_mem_map[root_op], tid);
b.create<gpu::BarrierOp>(loc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BarrierOp is not necessary, threads in a warp are synchronized all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrite the wrap reduction section with shuffle inst, and will update this PR later.

/*hasElseRegion*/ false);
b.setInsertionPointToStart(&if_tid_valid_op.getThenRegion().front());
SmallVector<Value, 4> yield_values;
for (int stride = 16; stride > 0; stride /= 2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start with stride = 32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants