Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: what is the speedup of attention kernel of current implemetation? #73

Open
foreverpiano opened this issue Sep 10, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@foreverpiano
Copy link

Describe the issue

图片
The pattern is good, but I wonder if we have hardware efficient kernel of these patterns. Have you test this sparse SDPA attention kernel speedup compared to original causal attention?

@foreverpiano foreverpiano added the question Further information is requested label Sep 10, 2024
@iofu728 iofu728 self-assigned this Sep 11, 2024
@iofu728
Copy link
Contributor

iofu728 commented Sep 11, 2024

Hi @foreverpiano, thank you for your interest in MInference.

We have released the GPU kernel in the library. You can follow the startup guide to use it.

We have also presented end-to-end speedup and micro benchmark results in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments and Appendix D.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants