[Question]: what is the speedup of attention kernel of current implemetation? #73

foreverpiano · 2024-09-10T13:45:37Z

Describe the issue

The pattern is good, but I wonder if we have hardware efficient kernel of these patterns. Have you test this sparse SDPA attention kernel speedup compared to original causal attention?

iofu728 · 2024-09-11T09:13:43Z

Hi @foreverpiano, thank you for your interest in MInference.

We have released the GPU kernel in the library. You can follow the startup guide to use it.

We have also presented end-to-end speedup and micro benchmark results in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments and Appendix D.2.

foreverpiano added the question Further information is requested label Sep 10, 2024

iofu728 self-assigned this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: what is the speedup of attention kernel of current implemetation? #73

[Question]: what is the speedup of attention kernel of current implemetation? #73

foreverpiano commented Sep 10, 2024

iofu728 commented Sep 11, 2024

[Question]: what is the speedup of attention kernel of current implemetation? #73

[Question]: what is the speedup of attention kernel of current implemetation? #73

Comments

foreverpiano commented Sep 10, 2024

Describe the issue

iofu728 commented Sep 11, 2024