Bump v2.6.0

tridao · tridao · commit da11d1b8535c · 2024-07-10T21:34:58.000-07:00
diff --git a/README.md b/README.md
@@ -314,6 +314,11 @@ Implement deterministic backward pass. Thanks to engineers from [Meituan](www.me
 Support paged KV cache (i.e., [PagedAttention](https://arxiv.org/abs/2309.06180)).
 Thanks to @beginlner for this contribution.
 
+### 2.6: Softcapping.
+
+Support attention with softcapping, as used in Gemma-2 and Grok models.
+Thanks to @Narsil for this contribution.
+
 ## Performance
 
 We present expected speedup (combined forward + backward pass) and memory savings from using FlashAttention against PyTorch standard attention, depending on sequence length, on different GPUs (speedup depends on memory bandwidth - we see more speedup on slower GPU memory).
diff --git a/flash_attn/__init__.py b/flash_attn/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "2.5.9.post1"
+__version__ = "2.6.0"
 
 from flash_attn.flash_attn_interface import (
     flash_attn_func,
diff --git a/training/Dockerfile b/training/Dockerfile
@@ -85,7 +85,7 @@ RUN pip install transformers==4.25.1 datasets==2.8.0 pytorch-lightning==1.8.6 tr
 RUN pip install git+https://github.com/mlcommons/logging.git@2.1.0
 
 # Install FlashAttention
-RUN pip install flash-attn==2.5.9.post1
+RUN pip install flash-attn==2.6.0
 
 # Install CUDA extensions for fused dense
-RUN pip install git+https://github.com/HazyResearch/flash-attention@v2.5.9.post1#subdirectory=csrc/fused_dense_lib
+RUN pip install git+https://github.com/HazyResearch/flash-attention@v2.6.0#subdirectory=csrc/fused_dense_lib

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-__version__ = "2.5.9.post1"`
	`1`	`+__version__ = "2.6.0"`
`2`	`2`
`3`	`3`	`from flash_attn.flash_attn_interface import (`
`4`	`4`	`flash_attn_func,`