-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Op dot #430
base: master
Are you sure you want to change the base?
Add Op dot #430
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice performance. please resolve the conflicts and CI could be permitted then.
src/flag_gems/ops/dot.py
Outdated
|
||
with torch_device_fn.device(x.device): | ||
dot_kernel_1[grid_1](x, y, mid, N, block_size) | ||
dot_kernel_2[grid_2](mid, out, mid_size, block_mid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to take tensor stride into consideration. but it's a good implementation!
src/flag_gems/ops/dot.py
Outdated
dot_kernel_1[grid_1](x, y, mid, N, block_size) | ||
dot_kernel_2[grid_2](mid, out, mid_size, block_mid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we resort to a single persistent kernel when the input numel is small enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably didnt make myself clear. What I suggested is adding a one pass branch to handle small input. We don't have to use atomic_add on either branch. The two pass branch still exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand!
@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience. |
I'm sorry I just saw it, I'll do it right now |
Don't be sorry. We are very grateful to you for your volunteering! |
PR Category
Operator
Type of Change
Add new operator
Description
Implement dot operator, support Float32, Float16,BFloat16。
The operator implementation is to split the dot operator into two steps, the first step implementing elementwise level multiplication, and the second step implementing summation.
At present, in order to ensure accuracy requirements, the intermediate results are saved as float32 type in the first step.
Issue
#394
Progress
Performance
correctness

performance


