A faster flash attention bwd implementation #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

tonywu95 wants to merge 4 commits into jax-ml:main from tonywu95:patch-2

Contributor

tonywu95 commented Jun 22, 2023 •

edited

Loading

Decompose the bwd kernel into two kernels, one for dq and one for dk,dv.
Extra parallelism over the sequence length axis.
On a benchmark, with causal=True, it is close to 6X faster compared to the previous implementation. ~3X faster than XLA bwd pass.


          A faster flash attention bwd implementation

c65144b

- Decompose the bwd kernel into two kernels, one for dq and one for dk,dv. 
- Extra parallelism over the sequence length axis.
- On a benchmark, it is 4X faster compared to the previous implementation. 2X faster than XLA bwd pass.

sharadmv requested changes

View reviewed changes

Collaborator

sharadmv left a comment

High level comment: the current backward pass is a fully fused kernel that parallelizes over batch * num heads number of threads.

For attention shapes that have small batch and heads (as is common in language model training) this kernel will underutilize the GPU.

However, there are applications where this kernel is faster than the two kernel variant.

Could you add the two kernel version as a separate backward pass impl, that way the user has the option of selecting the one they want?


          add back previous attention kernel as an option

7121d59

tonywu95 force-pushed the patch-2 branch from b3e0a7d to 7121d59 Compare

June 22, 2023 21:25

sharadmv requested changes

View reviewed changes

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

sharadmv reviewed

View reviewed changes

jax_triton/pallas/ops/attention.py Outdated Show resolved Hide resolved

tonywu95 added 2 commits

June 23, 2023 00:00


          fix comments by sharad

8156ad0


          delete a comment

bdbddc9

tonywu95 requested a review from sharadmv

June 23, 2023 15:42

sharadmv requested changes

View reviewed changes

Collaborator

sharadmv left a comment

Could you also add tests into pallas_test.py?

jax_triton/pallas/ops/attention.py

-                      pl.BlockSpec(lambda _, j, k: (j, 0, k, 0), (None, seq_len, None, head_dim)),
-                      pl.BlockSpec(lambda _, j, k: (j, 0, k, 0), (None, seq_len, None, head_dim)),
-                      pl.BlockSpec(lambda _, j, k: (j, k, 0), (None, None, seq_len)),
+                      pl.BlockSpec(lambda j, k, _: (j, 0, k, 0), (None, seq_len, None, head_dim)),

Collaborator

sharadmv Jun 23, 2023

Can you rename to be (i, j, _)? Same below?

jax_triton/pallas/ops/attention.py

+                  upper_bound = jt.cdiv(seq_len, block_k)
+                dq = lax.fori_loop(0, upper_bound, inner_loop, dq)
+                pl.store(dq_ref, (pl.ds(start_q * block_q, block_q),
+                                      slice(None)), dq, eviction_policy="evict_last")

Collaborator

sharadmv Jun 28, 2023

I don't think we need eviction policy here

jax_triton/pallas/ops/attention.py

Comment on lines +394 to +396

+                                    slice(None)), dv.astype(dv_ref.dtype))
+                pl.store(dk_ref, (pl.ds(start_k * block_k, block_k),
+                                    slice(None)), dk.astype(dk_ref.dtype))

Collaborator

sharadmv Jun 28, 2023

Nit: indentation

jax_triton/pallas/ops/attention.py

                       num_warps=num_warps,
                       num_stages=1,
                       input_output_aliases={8: 0})(q, k, v, out, do_scaled, l, m, delta, dq)
+                elif backward_pass_impl == "triton_split":
+                  # We accumulate into dq so we need to initialize it to zeros.

Collaborator

sharadmv Jun 28, 2023

Comment is not accurate here

jax_triton/pallas/ops/attention.py

                       num_warps=num_warps,
                       num_stages=1,
                       input_output_aliases={8: 0})(q, k, v, out, do_scaled, l, m, delta, dq)
+                elif backward_pass_impl == "triton_split":
+                  # We accumulate into dq so we need to initialize it to zeros.
+                  out_shapes_q = jax.ShapeDtypeStruct(q.shape, jnp.float32)

Collaborator

sharadmv Jun 28, 2023

I suspect we don't need dq to be f32 anymore. Could you try q.dtype?

abhinavgoel95 commented Aug 9, 2023 •

edited

Loading

@sharadmv Can this PR be merged? We see a big performance improvement on NVIDIA A100 GPUs with this PR.
Thank you.

Collaborator

sharadmv commented Aug 9, 2023

I left some comments. @tonywu95 do you have time to address them?

Member

skye commented Jun 13, 2024

Hey @tonywu95, is it ok if we take over this PR and put you as a co-author? We'd love to get it in!

Rifur13 mentioned this pull request

Faster MHA backwards pass jax-ml/jax#22820

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet