|
2 | 2 |
|
3 | 3 | # CUTLASS 4.x
|
4 | 4 |
|
5 |
| -## [4.2.0](https://github.com/NVIDIA/cutlass/tree/main) (2025-08-21) |
| 5 | +## [4.2.1](https://github.com/NVIDIA/cutlass/releases/tag/v4.2.1) (2025-09-22) |
6 | 6 |
|
7 | 7 | ### CuTe DSL
|
8 |
| -* We will likely be skipping 4.2.dev release and directly target 4.2. |
9 |
| -* CuTeDSL version remains at 4.1.0 till then. |
| 8 | +* Bug fixings and improvements |
| 9 | + - Fixed an issue when running DSL codes with cuda-python 13.0 |
| 10 | + - Fixed an issue when running inductor with DSL codes |
| 11 | + - Fixed an issue with unexpected logging when running DSL codes in FlashInfer |
| 12 | + - Fixed the issue reported in https://github.com/NVIDIA/cutlass/issues/2647 |
| 13 | + - Fixed an issue when conditional define of variables outside of dynamic control flow |
10 | 14 |
|
11 | 15 | ### CUTLASS C++
|
12 |
| -* Add K major scale factor support for Hopper SM90 blockwise kernels. |
| 16 | +* Bypass EVT for nosmem blockwise kernels on Blackwell. |
| 17 | +* Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen. |
| 18 | + |
| 19 | +## [4.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0) (2025-09-15) |
| 20 | + |
| 21 | +### CuTe DSL |
| 22 | +* More Python versions are now supported for both x86-64 and aarch64, including |
| 23 | + - Python 3.10, 3.11, 3.12, and 3.13 |
| 24 | +* Added new example and updated notebook to get started with CuTe DSL |
| 25 | + - [Call kernels with dlpack bypassed](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/call_bypass_dlpack.py) |
| 26 | + - Updates on [TensorSSA demonstration](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/tensorssa.ipynb) |
| 27 | + + Added a section for introducing the broadcast |
| 28 | +* API updates |
| 29 | + - Please refer to [DSL API changelog](https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_api/changelog.html) for details |
| 30 | +* Bug fixings and improvements |
| 31 | + - Fixed ``cute.print_tensor`` for coordinate tensor |
| 32 | + - Fixed `cute.print` for tuple of layouts |
| 33 | + - Fixed frozen object is not properly updated after fully assigned in dynamic control flow |
| 34 | + - Fixed assign tuple/list element in a dynamic control flow may cause compilation failure |
| 35 | + - Improved error message when CUDA context is not initialized |
| 36 | + - Improved docstring of congruent and weakly_congruent |
| 37 | + |
| 38 | +### CUTLASS C++ |
| 39 | +* Support for Blackwell SM103 kernels for B300 GPUs. |
| 40 | + - Collective mainloop codes: [Blockscaled datatypes with support for dense GEMM mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm103_blockscaled_mma_warpspecialized.hpp) |
| 41 | + - New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders. |
| 42 | + - Kernel codes: [Blockscaled datatypes with support for dense GEMM kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm103_blockscaled_gemm_tma_warpspecialized.hpp). |
| 43 | +* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture: |
| 44 | + - [Blockscaled ultra fp4 dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/89_sm103_fp4_ultra_gemm/). |
| 45 | + - [Blockscaled ultra fp4 dense grouped GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/90_sm103_fp4_ultra_grouped_gemm). |
| 46 | +* Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM |
| 47 | + - Unit test files with prefix name of `sm103_` under [GEMM device unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/). |
| 48 | +* Support for Blackwell SM121 kernels for DGX Spark GPUs. |
| 49 | + - Share the major codes with Blackwell SM120 kernels. |
| 50 | +* Add support for heuristics-based kernel filtering and autotuning using `nvidia-matmul-heuristics` to find the best kernels for a given scenario. |
| 51 | + - Details please refer to [heuristics doc](https://github.com/NVIDIA/cutlass/tree/main/media/docs/cpp/heuristics.md). |
13 | 52 | * Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
|
14 | 53 | - Add fused reduction kernel support for cutlass MLA.
|
| 54 | + - Add softmax skip correction. |
| 55 | + - Support for GQA in FMHA backward kernel. |
15 | 56 | - Fix an issue where `get_unmasked_trip_count` may return a negative value.
|
16 | 57 | - Fix an issue where mbarriers are initialized with a zero arrival count.
|
17 |
| -* Add Blackwell SM120 blockwise gemm kernel example: [example 87](https://github.com/NVIDIA/cutlass/tree/main/87_blackwell_geforce_gemm_blockwise/). |
18 |
| -* Support for Blackwell SM100 cpasync kernel. |
19 |
| - - Collective mainloop codes: [cpasync mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_cpasync_warpspecialized.hpp). |
20 |
| - - Kernel codes: [cpasync kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_cpasync_warpspecialized.hpp). |
21 |
| -* Support for Blackwell SM121 kernels for DGX Spark GPUs. |
22 |
| - - Share the major codes with Blackwell SM120 kernels. |
| 58 | + - Fix a corner case issue where the sequence length of q is not a multiple of tile_q. |
| 59 | + - Remove tma padding for forward kernel inputs. |
| 60 | +* Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm/). It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome. |
| 61 | +* Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell |
| 62 | + - On Blackwell SM120, a blockwise gemm kernel is added: [example 87](https://github.com/NVIDIA/cutlass/tree/main/examples/87_blackwell_geforce_gemm_blockwise/). |
| 63 | + - On Hopper, add K major scale factor support for SM90 blockwise kernels. |
| 64 | + - On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size. |
| 65 | + - On Hopper, grouped version supports the case when k = 0. |
| 66 | +* Support for Blackwell SM100 fp4 gemv kernels. |
| 67 | + - Kernel codes: [Gemv kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemv_blockscaled.h). |
| 68 | + - Example codes: [example 91](https://github.com/NVIDIA/cutlass/tree/main/examples/91_fp4_gemv/) |
23 | 69 | * Support for Blackwell SM100 legacy mixed input GEMM kernels.
|
24 | 70 | - Collective mainloop codes: [Mixed input mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp).
|
25 | 71 | - Kernel codes: [Mixed input kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_mixed_input_transform.hpp).
|
26 | 72 | - Example codes: [example 86](https://github.com/NVIDIA/cutlass/tree/main/examples/86_blackwell_mixed_dtype_gemm/).
|
27 |
| -* Support for Blackwell SM100 fp4 gemv kernels. |
28 |
| - - Kernel codes: [Gemv kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemv_blockscaled.h). |
29 |
| - - Example codes: [example 91](https://github.com/NVIDIA/cutlass/tree/main/examples/91_fp4_gemv/) |
| 73 | +* Support for Blackwell SM100 cpasync kernel. |
| 74 | + - Collective mainloop codes: [cpasync mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_cpasync_warpspecialized.hpp). |
| 75 | + - Kernel codes: [cpasync kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_cpasync_warpspecialized.hpp). |
| 76 | +* Support Blackwell SM120 mixed input blockscaled grouped GEMM. |
| 77 | +* Instantiating more Blackwell kernels in profiler. |
| 78 | + - Blackwell SM100 and SM103 kernels support `CUTLASS_LIBRARY_INSTANTIATION_LEVEL` to instantiate all possible combinations. |
| 79 | + - To use this feature, `CUTLASS_LIBRARY_KERNELS` must be non-empty. Profiler will combine `CUTLASS_LIBRARY_KERNELS` and `CUTLASS_LIBRARY_INSTANTIATION_LEVEL` to instantiate specific kernels. |
| 80 | + - Details please check [Profiler Doc](https://github.com/NVIDIA/cutlass/tree/main/media/docs/cpp/profiler.md). |
| 81 | +* Fix some profiler issues: |
| 82 | + - Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line. |
| 83 | + - Fix some no output and timeout issues. |
| 84 | + - Fix Pingpong Blockwise Hopper library generation. |
30 | 85 | * From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
|
31 | 86 | - For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
|
32 | 87 | - For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
|
| 88 | +* Rename legacy Python API package from `cutlass` to `cutlass_cppgen` and add Blackwell EVT support to legacy Python interface. |
| 89 | + - Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's `EpilogueDescriptors`. |
| 90 | + - Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter. |
| 91 | + - Added some support for running SM100 kernels via the Python interface. |
33 | 92 | * CuTe changes:
|
34 | 93 | - Fix inaccurate GridDim calculation under [CuTe tutorial](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/blackwell/).
|
35 | 94 | - Add [movmatrix](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-movmatrix) support.
|
|
38 | 97 | - Shorten `nullspace` implementation.
|
39 | 98 | - Isolate and comment on `cosize` hacks.
|
40 | 99 | - Important documentation correction: `E<0,1> == 1@0@1`.
|
41 |
| -* Add support for heuristics-based kernel filtering and autotuning using `nvidia-matmul-heuristics`. |
42 |
| - - Details please refer to [heuristics doc](https://github.com/NVIDIA/cutlass/tree/main/media/docs/cpp/heuristics.md). |
43 |
| -* Rename legacy Python API package from `cutlass` to `cutlass_cppgen`. |
44 |
| -* Fix some profiler issues: |
45 |
| - - Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line. |
46 |
| - - Fix some no output and timeout issues. |
| 100 | +* Fix some kernel issues: |
| 101 | + - Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers. |
| 102 | + - Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel. |
47 | 103 | * Add following unit tests:
|
48 | 104 | - [fp16 accmulator for sm89 fp8 mma](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/ampere/cooperative_gemm.cu)
|
49 | 105 | - [movmatrix test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/turing/movm.cu)
|
50 | 106 | - [fp8 narrow mma n](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32_narrow_mma_n.cu) and [fp16 narrow mma n](test/unit/gemm/device/sm100_tensorop_gemm/f8_f8_void_bf16_narrow_mma_n.cu)
|
51 | 107 | * Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
|
52 |
| -* Optimal code generation with CUDA toolkit versions 13.0. |
| 108 | +* Optimal code generation with CUDA toolkit versions 13.0U1. |
53 | 109 |
|
54 | 110 | ## [4.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v4.1.0) (2025-07-16)
|
55 | 111 |
|
|
0 commit comments