Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Op dot #430

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Add Op dot #430

wants to merge 6 commits into from

Conversation

wlxjhyf
Copy link

@wlxjhyf wlxjhyf commented Jan 21, 2025

PR Category

Operator

Type of Change

Add new operator

Description

Implement dot operator, support Float32, Float16,BFloat16。
The operator implementation is to split the dot operator into two steps, the first step implementing elementwise level multiplication, and the second step implementing summation.
At present, in order to ensure accuracy requirements, the intermediate results are saved as float32 type in the first step.

Issue

#394

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

correctness
截屏2025-01-21 02 00 04

performance
截屏2025-01-21 02 01 36
截屏2025-01-21 02 01 42
截屏2025-01-21 02 01 49

@wlxjhyf wlxjhyf changed the title dot Add op dot Jan 22, 2025
@wlxjhyf wlxjhyf changed the title Add op dot Add Op dot Jan 22, 2025
Copy link
Collaborator

@StrongSpoon StrongSpoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice performance. please resolve the conflicts and CI could be permitted then.


with torch_device_fn.device(x.device):
dot_kernel_1[grid_1](x, y, mid, N, block_size)
dot_kernel_2[grid_2](mid, out, mid_size, block_mid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to take tensor stride into consideration. but it's a good implementation!

@StrongSpoon StrongSpoon self-assigned this Feb 17, 2025
Comment on lines 73 to 74
dot_kernel_1[grid_1](x, y, mid, N, block_size)
dot_kernel_2[grid_2](mid, out, mid_size, block_mid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we resort to a single persistent kernel when the input numel is small enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable.

Copy link
Author

@wlxjhyf wlxjhyf Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement dot in a single kernel using atomic_add, but even in very small input numel, the performance was not good, but I still kept the code in function dot_kernel.
this shows the performance of kernel1 and kernel2
kernel1+2
this shows the performance of single kernel
kenrel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably didnt make myself clear. What I suggested is adding a one pass branch to handle small input. We don't have to use atomic_add on either branch. The two pass branch still exists.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand!

@tongxin
Copy link
Contributor

tongxin commented Feb 17, 2025

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

@wlxjhyf
Copy link
Author

wlxjhyf commented Feb 18, 2025

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

I'm sorry I just saw it, I'll do it right now

@tongxin
Copy link
Contributor

tongxin commented Feb 25, 2025

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

I'm sorry I just saw it, I'll do it right now

Don't be sorry. We are very grateful to you for your volunteering!

@wlxjhyf
Copy link
Author

wlxjhyf commented Feb 26, 2025

When using a single kernel, N must be smaller than tl.TRITON_MAX_TENSOR_NUMEL (1048576).In my tests on A100, I found that when N is smaller than 4096, using a single operator can still maintain good performance.So, I currently design 4096 as the branching condition.

11 12 13

And I found that the reason for the failure of the last submission was the failure of the test_index_put_acc_true test. Does this relate to my code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants