Conversation
cuDNN Frontend v1.23.0 is the recommended version for [cuDNN 9.21.0](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-21-0) and later releases. cudnn-frontend now has pip wheels for python 3.14t. ## New APIs 🚀 🚀 ### Causal Conv1d - Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): `y = activation(conv1d_causal(x, w) + b)` Supports forward and backward passes with `torch.autograd` and `torch.compile`. (Not supported on Windows yet) ### Updates to Graph API #### Transpose (requires cuDNN 9.22.0) - Added new `Graph::transpose` with `Transpose_attributes(permutation, optional compute dtype, name)` #### Slice (requires cuDNN 9.22.0) - Extend `Slice_attributes` with `set_strides` for per-axis slice steps; strided slices update inferred output shape and strides accordingly. - Python: `pygraph.slice` now honors each dimension's slice.step #### Concatenate (requires cuDNN 9.22.0) - Extend `Concatenate_attributes` with `set_in_place_index` (optional). When unset, concatenate runs out-of-place per backend rules. #### Reshape (requires cuDNN 9.22.0) - Introduce `ReshapeMode_t(VIEW_ONLY,LOGICAL)` and `Reshape_attributes::set_reshape_mode` so reshapes can select view-style vs lexicographic logical reshape. #### Compile-time constants (requires cuDNN 9.22.0) - Added `cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST)` and `Graph::tensor(scalar, ScalarType)` overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan. - `Tensor_attributes` can be marked as a compile-time constant or a normal runtimepass-by-value scalar; ## Open source kernels 🚀 🚀 * **[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu):** High-performance implementation of squared-ReLU fused with GEMM. * **[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu):** High-performance implementation of dsquared-ReLU fused with GEMM. * **[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard):** Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction. * **[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu):** Contiguous grouped squared-ReLU GEMM for MoE workloads. * **[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu):** Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads. * **[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax)**: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size `16`, and a per-CTA `amax` reduction. Fix block-scale quantize The `scale` tensor uses a 128x4 reordered layout (`TensorReordering_t::F8_128x4`). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4). * **[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu):** High-performance implementation of squared-ReLU fused with GEMM. * **[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu):** High-performance implementation of dsquared-ReLU fused with GEMM. * **[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard):** Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction. * **[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu):** Contiguous grouped squared-ReLU GEMM for MoE workloads. * **[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu):** Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads. * **[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax)**: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size `16`, and a per-CTA `amax` reduction. ## General Improvements ✨✨ * Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set `CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0` to restore the previous M-only dynamic behavior. * Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete) * Unused internal c_tensor removed from Grouped GEMM quant path ## Bug fix 🐛 * Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL * Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22. ## Benchmarking 📊 * Updated the benchmark results for the SDPA improvements. Added `Kimi-K2.6`, `LTX-2`, `Qwen 2.5` , `Wan2.2` to the benchmark results page. ## Acknowledgements: * Thanks @haowen-han for fixing a bug in the block-scale matmul sample.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.
cudnn-frontend now has pip wheels for python 3.14t.
New APIs 🚀 🚀
Causal Conv1d
y = activation(conv1d_causal(x, w) + b)Supports forward and backward passes withtorch.autogradandtorch.compile. (Not supported on Windows yet)Updates to Graph API
Transpose (requires cuDNN 9.22.0)
Graph::transposewithTranspose_attributes(permutation, optional compute dtype, name)Slice (requires cuDNN 9.22.0)
Slice_attributeswithset_stridesfor per-axis slice steps; strided slices update inferred output shape and strides accordingly.pygraph.slicenow honors each dimension's slice.stepConcatenate (requires cuDNN 9.22.0)
Concatenate_attributeswithset_in_place_index(optional). When unset, concatenate runs out-of-place per backend rules.Reshape (requires cuDNN 9.22.0)
ReshapeMode_t(VIEW_ONLY,LOGICAL)andReshape_attributes::set_reshape_modeso reshapes can select view-style vs lexicographic logical reshape.Compile-time constants (requires cuDNN 9.22.0)
cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST)andGraph::tensor(scalar, ScalarType)overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.Tensor_attributescan be marked as a compile-time constant or a normal runtimepass-by-value scalar;Open source kernels 🚀 🚀
16, and a per-CTAamaxreduction.Fix block-scale quantize
The
scaletensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).16, and a per-CTAamaxreduction.General Improvements ✨✨
Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set
CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0to restore the previous M-only dynamic behavior.Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)
Unused internal c_tensor removed from Grouped GEMM quant path
Bug fix 🐛
Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL
Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.
Benchmarking 📊
Kimi-K2.6,LTX-2,Qwen 2.5,Wan2.2to the benchmark results page.Acknowledgements: