Flash attention #18270
Unanswered
pfeatherstone
asked this question in
General
Flash attention
#18270
Replies: 1 comment
-
I'm also paying attention to this problem. flash-atten doesn't seem to be handled well, which leads to the problem of insufficient memory when ONNXRUNTIME reasoning. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Does Onnxruntime use flash attention ?
I noticed in contrib operations there are CPU and CUDA implementations of memory efficient attention. Are they used generally in the CPU and CUDA providers or are they specific to BERT?
For example, does Pytorch's
scaled_dot_product_attention()
get ONNX-exported to an efficient kernel, or does it get unfolded into a bunch of matmult operations?Thank you.
Beta Was this translation helpful? Give feedback.
All reactions