# cuDNN Frontend v1.23.0 Release Notes by Anerudhan · Pull Request #231 · NVIDIA/cudnn-frontend

Anerudhan · 2026-04-29T17:52:47Z

cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

New APIs 🚀 🚀

Causal Conv1d

Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): y = activation(conv1d_causal(x, w) + b) Supports forward and backward passes with torch.autograd and torch.compile. (Not supported on Windows yet)

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

Added new Graph::transpose with Transpose_attributes(permutation, optional compute dtype, name)

Slice (requires cuDNN 9.22.0)

Extend Slice_attributes with set_strides for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
Python: pygraph.slice now honors each dimension's slice.step

Concatenate (requires cuDNN 9.22.0)

Extend Concatenate_attributes with set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.

Reshape (requires cuDNN 9.22.0)

Introduce ReshapeMode_t(VIEW_ONLY,LOGICAL) and Reshape_attributes::set_reshape_mode so reshapes can select view-style vs lexicographic logical reshape.

Compile-time constants (requires cuDNN 9.22.0)

Added cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST) and Graph::tensor(scalar, ScalarType) overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
Tensor_attributes can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

Open source kernels 🚀 🚀

GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Fix block-scale quantize
The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

General Improvements ✨✨

Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to restore the previous M-only dynamic behavior.
Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)
Unused internal c_tensor removed from Grouped GEMM quant path

Bug fix 🐛

Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL
Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

Benchmarking 📊

Updated the benchmark results for the SDPA improvements. Added Kimi-K2.6, LTX-2, Qwen 2.5 , Wan2.2 to the benchmark results page.

Acknowledgements:

Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

@haowen-han

cuDNN Frontend v1.23.0 is the recommended version for [cuDNN 9.21.0](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-21-0) and later releases. cudnn-frontend now has pip wheels for python 3.14t. ## New APIs 🚀 🚀 ### Causal Conv1d - Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): `y = activation(conv1d_causal(x, w) + b)` Supports forward and backward passes with `torch.autograd` and `torch.compile`. (Not supported on Windows yet) ### Updates to Graph API #### Transpose (requires cuDNN 9.22.0) - Added new `Graph::transpose` with `Transpose_attributes(permutation, optional compute dtype, name)` #### Slice (requires cuDNN 9.22.0) - Extend `Slice_attributes` with `set_strides` for per-axis slice steps; strided slices update inferred output shape and strides accordingly. - Python: `pygraph.slice` now honors each dimension's slice.step #### Concatenate (requires cuDNN 9.22.0) - Extend `Concatenate_attributes` with `set_in_place_index` (optional). When unset, concatenate runs out-of-place per backend rules. #### Reshape (requires cuDNN 9.22.0) - Introduce `ReshapeMode_t(VIEW_ONLY,LOGICAL)` and `Reshape_attributes::set_reshape_mode` so reshapes can select view-style vs lexicographic logical reshape. #### Compile-time constants (requires cuDNN 9.22.0) - Added `cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST)` and `Graph::tensor(scalar, ScalarType)` overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan. - `Tensor_attributes` can be marked as a compile-time constant or a normal runtimepass-by-value scalar; ## Open source kernels 🚀 🚀 * **[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu):** High-performance implementation of squared-ReLU fused with GEMM. * **[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu):** High-performance implementation of dsquared-ReLU fused with GEMM. * **[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard):** Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction. * **[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu):** Contiguous grouped squared-ReLU GEMM for MoE workloads. * **[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu):** Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads. * **[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax)**: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size `16`, and a per-CTA `amax` reduction. Fix block-scale quantize The `scale` tensor uses a 128x4 reordered layout (`TensorReordering_t::F8_128x4`). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4). * **[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu):** High-performance implementation of squared-ReLU fused with GEMM. * **[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu):** High-performance implementation of dsquared-ReLU fused with GEMM. * **[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard):** Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction. * **[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu):** Contiguous grouped squared-ReLU GEMM for MoE workloads. * **[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu):** Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads. * **[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax)**: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size `16`, and a per-CTA `amax` reduction. ## General Improvements ✨✨ * Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set `CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0` to restore the previous M-only dynamic behavior. * Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete) * Unused internal c_tensor removed from Grouped GEMM quant path ## Bug fix 🐛 * Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL * Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22. ## Benchmarking 📊 * Updated the benchmark results for the SDPA improvements. Added `Kimi-K2.6`, `LTX-2`, `Qwen 2.5` , `Wan2.2` to the benchmark results page. ## Acknowledgements: * Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

Anerudhan merged commit fb682ce into main Apr 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# cuDNN Frontend v1.23.0 Release Notes#231

# cuDNN Frontend v1.23.0 Release Notes#231
Anerudhan merged 1 commit intomainfrom
1.23.0-rc

Anerudhan commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Anerudhan commented Apr 29, 2026

New APIs 🚀 🚀

Causal Conv1d

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

Slice (requires cuDNN 9.22.0)

Concatenate (requires cuDNN 9.22.0)

Reshape (requires cuDNN 9.22.0)

Compile-time constants (requires cuDNN 9.22.0)

Open source kernels 🚀 🚀

General Improvements ✨✨

Bug fix 🐛

Benchmarking 📊

Acknowledgements:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant