Flash attention #18270

pfeatherstone · 2023-11-03T08:32:05Z

pfeatherstone
Nov 3, 2023

Does Onnxruntime use flash attention ?
I noticed in contrib operations there are CPU and CUDA implementations of memory efficient attention. Are they used generally in the CPU and CUDA providers or are they specific to BERT?
For example, does Pytorch's scaled_dot_product_attention() get ONNX-exported to an efficient kernel, or does it get unfolded into a bunch of matmult operations?
Thank you.

lauvli · 2024-10-21T07:42:16Z

lauvli
Oct 21, 2024

I'm also paying attention to this problem. flash-atten doesn't seem to be handled well, which leads to the problem of insufficient memory when ONNXRUNTIME reasoning.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention #18270

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Flash attention #18270

pfeatherstone Nov 3, 2023

Replies: 1 comment

lauvli Oct 21, 2024

pfeatherstone
Nov 3, 2023

lauvli
Oct 21, 2024