Add Op dot #430

wlxjhyf · 2025-01-21T06:12:36Z

PR Category

Operator

Type of Change

Add new operator

Description

Implement dot operator, support Float32, Float16，BFloat16。
The operator implementation is to split the dot operator into two steps, the first step implementing elementwise level multiplication, and the second step implementing summation.
At present, in order to ensure accuracy requirements, the intermediate results are saved as float32 type in the first step.

Issue

#394

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

correctness

performance

StrongSpoon

nice performance. please resolve the conflicts and CI could be permitted then.

StrongSpoon · 2025-02-07T06:26:12Z

src/flag_gems/ops/dot.py

+
+    with torch_device_fn.device(x.device):
+        dot_kernel_1[grid_1](x, y, mid, N, block_size)
+        dot_kernel_2[grid_2](mid, out, mid_size, block_mid)


I think it's better to take tensor stride into consideration. but it's a good implementation!

tongxin · 2025-02-17T07:50:28Z

src/flag_gems/ops/dot.py

+        dot_kernel_1[grid_1](x, y, mid, N, block_size)
+        dot_kernel_2[grid_2](mid, out, mid_size, block_mid)


Can we resort to a single persistent kernel when the input numel is small enough?

reasonable.

I tried to implement dot in a single kernel using atomic_add, but even in very small input numel, the performance was not good, but I still kept the code in function dot_kernel.
this shows the performance of kernel1 and kernel2

this shows the performance of single kernel

I probably didnt make myself clear. What I suggested is adding a one pass branch to handle small input. We don't have to use atomic_add on either branch. The two pass branch still exists.

Understand！

tongxin · 2025-02-17T07:53:59Z

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

wlxjhyf · 2025-02-18T03:05:04Z

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

I'm sorry I just saw it, I'll do it right now

tongxin · 2025-02-25T08:13:53Z

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

I'm sorry I just saw it, I'll do it right now

Don't be sorry. We are very grateful to you for your volunteering!

wlxjhyf · 2025-02-26T14:44:04Z

When using a single kernel, N must be smaller than tl.TRITON_MAX_TENSOR_NUMEL (1048576).In my tests on A100, I found that when N is smaller than 4096, using a single operator can still maintain good performance.So, I currently design 4096 as the branching condition.

And I found that the reason for the failure of the last submission was the failure of the test_index_put_acc_true test. Does this relate to my code?

dot

198880b

wlxjhyf mentioned this pull request Jan 21, 2025

Code Contribution: [Medium] [Operator Development] dot #394

Open

wlxjhyf changed the title ~~dot~~ Add op dot Jan 22, 2025

wlxjhyf changed the title ~~Add op dot~~ Add Op dot Jan 22, 2025

StrongSpoon reviewed Feb 7, 2025

View reviewed changes

StrongSpoon self-assigned this Feb 17, 2025

tongxin reviewed Feb 17, 2025

View reviewed changes

wlxjhyf added 3 commits February 24, 2025 12:50

dot kernel

e329cd1

resolve conflicts of dot kernel

d0f012b

Merge branch 'master' into master

e026813

wlxjhyf and others added 2 commits February 25, 2025 13:20

fix_format_error

d452bc3

fix with single kernel in small input

7d9f603

wlxjhyf requested review from tongxin and StrongSpoon February 27, 2025 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Op dot #430

Add Op dot #430

wlxjhyf commented Jan 21, 2025 •

edited

Loading

StrongSpoon left a comment •

edited

Loading

StrongSpoon Feb 7, 2025

tongxin Feb 17, 2025

StrongSpoon Feb 18, 2025

wlxjhyf Feb 24, 2025 •

edited

Loading

tongxin Feb 25, 2025

wlxjhyf Feb 26, 2025

tongxin commented Feb 17, 2025

wlxjhyf commented Feb 18, 2025

tongxin commented Feb 25, 2025

wlxjhyf commented Feb 26, 2025

		dot_kernel_1[grid_1](x, y, mid, N, block_size)
		dot_kernel_2[grid_2](mid, out, mid_size, block_mid)

Add Op dot #430

Are you sure you want to change the base?

Add Op dot #430

Conversation

wlxjhyf commented Jan 21, 2025 • edited Loading

PR Category

Type of Change

Description

Issue

Progress

Performance

StrongSpoon left a comment • edited Loading

Choose a reason for hiding this comment

StrongSpoon Feb 7, 2025

Choose a reason for hiding this comment

tongxin Feb 17, 2025

Choose a reason for hiding this comment

StrongSpoon Feb 18, 2025

Choose a reason for hiding this comment

wlxjhyf Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

tongxin Feb 25, 2025

Choose a reason for hiding this comment

wlxjhyf Feb 26, 2025

Choose a reason for hiding this comment

tongxin commented Feb 17, 2025

wlxjhyf commented Feb 18, 2025

tongxin commented Feb 25, 2025

wlxjhyf commented Feb 26, 2025

wlxjhyf commented Jan 21, 2025 •

edited

Loading

StrongSpoon left a comment •

edited

Loading

wlxjhyf Feb 24, 2025 •

edited

Loading