Skip to content

# cuDNN Frontend v1.23.0 Release Notes#231

Merged
Anerudhan merged 1 commit intomainfrom
1.23.0-rc
Apr 29, 2026
Merged

# cuDNN Frontend v1.23.0 Release Notes#231
Anerudhan merged 1 commit intomainfrom
1.23.0-rc

Conversation

@Anerudhan
Copy link
Copy Markdown
Collaborator

cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

New APIs 🚀 🚀

Causal Conv1d

  • Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): y = activation(conv1d_causal(x, w) + b) Supports forward and backward passes with torch.autograd and torch.compile. (Not supported on Windows yet)

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

  • Added new Graph::transpose with Transpose_attributes(permutation, optional compute dtype, name)

Slice (requires cuDNN 9.22.0)

  • Extend Slice_attributes with set_strides for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
  • Python: pygraph.slice now honors each dimension's slice.step

Concatenate (requires cuDNN 9.22.0)

  • Extend Concatenate_attributes with set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.

Reshape (requires cuDNN 9.22.0)

  • Introduce ReshapeMode_t(VIEW_ONLY,LOGICAL) and Reshape_attributes::set_reshape_mode so reshapes can select view-style vs lexicographic logical reshape.

Compile-time constants (requires cuDNN 9.22.0)

  • Added cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST) and Graph::tensor(scalar, ScalarType) overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
  • Tensor_attributes can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

Open source kernels 🚀 🚀

  • GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
  • GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
  • Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
  • Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
  • Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
  • RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Fix block-scale quantize
The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

  • GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
  • GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
  • Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
  • Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
  • Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
  • RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

General Improvements ✨✨

  • Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to restore the previous M-only dynamic behavior.

  • Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)

  • Unused internal c_tensor removed from Grouped GEMM quant path

Bug fix 🐛

  • Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL

  • Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

Benchmarking 📊

  • Updated the benchmark results for the SDPA improvements. Added Kimi-K2.6, LTX-2, Qwen 2.5 , Wan2.2 to the benchmark results page.

Acknowledgements:

  • Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

cuDNN Frontend v1.23.0 is the recommended version for [cuDNN 9.21.0](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-21-0) and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

## New APIs  🚀 🚀

### Causal Conv1d

- Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): `y = activation(conv1d_causal(x, w) + b)`
Supports forward and backward passes with `torch.autograd` and `torch.compile`. (Not supported on Windows yet)

### Updates to Graph API

#### Transpose (requires cuDNN 9.22.0)
- Added new  `Graph::transpose` with `Transpose_attributes(permutation, optional compute dtype, name)`

#### Slice (requires cuDNN 9.22.0)
- Extend `Slice_attributes` with `set_strides` for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
- Python: `pygraph.slice` now honors each dimension's slice.step

#### Concatenate (requires cuDNN 9.22.0)
- Extend `Concatenate_attributes` with `set_in_place_index` (optional). When unset, concatenate runs out-of-place per backend rules.

#### Reshape (requires cuDNN 9.22.0)
- Introduce `ReshapeMode_t(VIEW_ONLY,LOGICAL)` and `Reshape_attributes::set_reshape_mode` so reshapes can select view-style vs lexicographic logical reshape.

#### Compile-time constants (requires cuDNN 9.22.0)
- Added `cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST)` and `Graph::tensor(scalar, ScalarType)` overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
- `Tensor_attributes` can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

## Open source kernels  🚀 🚀
*   **[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu):** High-performance implementation of squared-ReLU fused with GEMM.
*   **[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu):** High-performance implementation of dsquared-ReLU fused with GEMM.
*   **[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard):** Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
*   **[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu):** Contiguous grouped squared-ReLU GEMM for MoE workloads.
*   **[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu):** Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
*   **[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax)**: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size `16`, and a per-CTA `amax` reduction.

Fix block-scale quantize
The `scale` tensor uses a 128x4 reordered layout (`TensorReordering_t::F8_128x4`). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

*   **[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu):** High-performance implementation of squared-ReLU fused with GEMM.
*   **[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu):** High-performance implementation of dsquared-ReLU fused with GEMM.
*   **[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard):** Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
*   **[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu):** Contiguous grouped squared-ReLU GEMM for MoE workloads.
*   **[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu):** Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
*   **[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax)**: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size `16`, and a per-CTA `amax` reduction.

## General Improvements  ✨✨

* Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set `CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0` to restore the previous M-only dynamic behavior.

* Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)

* Unused internal c_tensor removed from Grouped GEMM quant path

## Bug fix 🐛

* Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL

* Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

## Benchmarking 📊

*   Updated the benchmark results for the SDPA improvements. Added `Kimi-K2.6`, `LTX-2`, `Qwen 2.5` , `Wan2.2` to the benchmark results page.

## Acknowledgements:
* Thanks @haowen-han for fixing a bug in the block-scale matmul sample.
@Anerudhan Anerudhan merged commit fb682ce into main Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant