Sourced from torch's releases.
PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention
PyTorch 2.5 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
We are excited to announce the release of PyTorch® 2.5! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode. This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page. As well, please check out our new ecosystem projects releases with TorchRec and TorchFix.
Beta Prototype CuDNN backend for SDPA FlexAttention torch.compile regional compilation without recompilations Compiled Autograd TorchDynamo added support for exception handling & MutableMapping types Flight Recorder TorchInductor CPU backend optimization Max-autotune Support on CPU with GEMM Template TorchInductor on Windows FP16 support on CPU path for both eager mode and TorchInductor CPP backend Autoload Device Extension Enhanced Intel GPU support *To see a full list of public feature submissions click here.
BETA FEATURES
[Beta] CuDNN backend for SDPA
The cuDNN "Fused Flash Attention" backend was landed for
torch.nn.functional.scaled_dot_product_attention
. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.[Beta] torch.compile regional compilation without recompilations
Regional compilation without recompilations, via
torch._dynamo.config.inline_inbuilt_nn_modules
which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.See the tutorial for more information.
[Beta] TorchInductor CPU backend optimization
This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.
Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.
PROTOTYPE FEATURES
[Prototype] FlexAttention
We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.
For more information and examples, please refer to the official blog post and Attention Gym.
... (truncated)
32f585d
[Release only] use triton 3.1.x from pypi (#137895)417a076
[split build] move periodic split builds into own concurrency group (#135510)...119e734
[RELEASE-ONLY CHANGES] Fix dependency on filesystem on Linux (#137242)783a6a4
[MPS] Add regression test for fft.fftfreq
(#137215)5375201
[MPS] Add missing dispatch to rshift.Tensor (#137212)1de132e
[MPS] Fix 5D+ reductions over negative dimentions (#137211)0b1b609
[NCCL] Don't override waitUntilInitialized
's setting of
`comm->initialized_...0b45af9
Fix addmm silent correctness on aarch64 (#137208)1a0b166
[ONNX] Add assertion nodes to ignoring list (#137214)3a541ef
Clarify that libtorch
API is C++17 compatible (#137206)