[WIP]: Initial DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

zyklotomic · 2025-03-09T18:07:49Z

Had a stab at making Flex Attention work without excessive recompilation. I am not fully confident in this approach, it kinda feels jank to the max. Hence, I wanted to have confirmation if this is the right approach.

In essence, the kernel has to recompile every time the input sizes change. Hence, why not compile a kernel for a larger size, and pad inputs when necessary, and then splice the result before returning. See code for more thorough comments.

I haven't had the chance to really test the performance yet. There are potential enhancements too that I mention in the comments.

Will attach testing code for a demo in a bit.

zyklotomic · 2025-03-09T18:14:25Z

See following gist

https://gist.github.com/zyklotomic/527cb96da86c2b5f5984bede3be9b227

danielhanchen · 2025-03-10T11:42:33Z

Hey! Great PR! Do you know why dynamic = True fails? Hmm padding to 128 will also require inputs ie the data collator to pad to 128

zyklotomic · 2025-03-12T04:54:29Z

I have some interesting findings to report back! Should have dug deeper initially. Turns out getting dynamic shapes to work is something that has been worked on, and apparently is available in the nightly version of PyTorch.

Links of interest:
pytorch/pytorch#135206
pytorch/pytorch#147756
tolleybot/pytorch@4a57fd0

https://github.com/pytorch/pytorch/blob/8d08b4901586f230353a558ee00c16ad57f95178/torch/_inductor/kernel/flex_attention.py#L705 (most recent commit as of writing) -> which points to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/flex_decoding.py#L336

Something interesting to note about the above is the check that the query length is under 128 as seen in sympy.Lt(query.get_size()[-2], 128) in order to use the dynamic kernel? So maybe there is a case for the approach seen in this PR after all for larger sequence lengths? It does line up with my understanding that dynamic=True is slow and not worth it for large sizes.

I did try my example notebook and set dynamic=True with the nightly kernel and ended up with a different failure though. Granted it is a nightly release after all. When dynamic=False, I saw that a lot of auto-tuning logic has since been added.

Not a torch.compile expert, a lot of this is new to me, and I definitely did not understand 100% of the links I shared lol.

As for your question on why dynamic=True failed, I suspect at least one reason is that the shapes just didn't match up with the hard-coded possible block sizes https://github.com/pytorch/pytorch/blob/v2.5.1/torch/_inductor/kernel/flex_attention.py#L822 ? A lot of speculation here.

What do you think is the best course of action? Should we wait for the PyTorch folks to stabilize instead?

Initial DynamicFlexAttention wrapper class for dynamic sequence lengths

e84645a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]: Initial DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

[WIP]: Initial DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

zyklotomic commented Mar 9, 2025

zyklotomic commented Mar 9, 2025

danielhanchen commented Mar 10, 2025

zyklotomic commented Mar 12, 2025

[WIP]: Initial DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

Are you sure you want to change the base?

[WIP]: Initial DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

Conversation

zyklotomic commented Mar 9, 2025

zyklotomic commented Mar 9, 2025

danielhanchen commented Mar 10, 2025

zyklotomic commented Mar 12, 2025