Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

solrex · 2025-05-24T02:44:08Z

Inspired by #1932 and #2037, implement blockscaling kernel on platforms before SM90.

FP8 blockwise/groupwise scaling kernel for Ada(L20, L40S, 4090) (Requires accumulator type to be float)
INT8 blockwise/groupwise scaling kernel for Ampere(A100/800, A10, A30) (Requires accumulator type to be int)
CUTLASS 3.x API

hwu36 · 2025-05-28T02:05:16Z

include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp

solrex · 2025-05-28T08:01:28Z

The following are the example benchmark results on L40S with CUDA 12.4 and CUTLASS main:

FP8:

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85a_ada_fp8_gemm_with_groupwise_scaling_cute
Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _64, _128, _128
  ScaleGranularityM: 1 (ScaleMsPerTile: 64)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 2.79446e-06, MRE: 12.0697, greatest error: 0.0196838
  Disposition: Passed
  Avg runtime: 0.00905421 ms
  GFLOPS: 237181

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85b_ada_fp8_gemm_with_blockwise_scaling_cute
  Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _128, _128, _128
  ScaleGranularityM: 128 (ScaleMsPerTile: 1)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 2.61817e-06, MRE: 11.7382, greatest error: 0.0210075
  Disposition: Passed
  Avg runtime: 0.0233175 ms
  GFLOPS: 92097.5

INT8: 

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85c_ampere_int8_gemm_with_groupwise_scaling_cute
  Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _64, _128, _128
  ScaleGranularityM: 1 (ScaleMsPerTile: 64)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 0, MRE: 81.7363, greatest error: 0
  Disposition: Passed
  Avg runtime: 0.00911155 ms
  GFLOPS: 235688

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85d_ampere_int8_gemm_with_blockwise_scaling_cute
  Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _128, _128, _128
  ScaleGranularityM: 128 (ScaleMsPerTile: 1)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 0, MRE: 77.9124, greatest error: 0
  Disposition: Passed
  Avg runtime: 0.0239155 ms
  GFLOPS: 89794.6

github-actions · 2025-06-27T08:08:38Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

sunjianxide · 2025-07-10T06:19:33Z

Thank you very much for your work. Recently, I saw a CUTLASS PR [https://github.com//pull/2378] that uses FP16 accumulation on SM89. Do you think this can be combined with the kernel from your current PR? I experimented with it, but encountered many issues. Would you have time to work on this？ Thank you very much.

* FP8 blockwise/groupwise kernel for Ada(L20,L40S,4090) * INT8 blockwise/groupwise kernel for Ampere(A100/800)

solrex · 2025-07-10T07:34:30Z

Thank you very much for your work. Recently, I saw a CUTLASS PR [https://github.com//pull/2378] that uses FP16 accumulation on SM89. Do you think this can be combined with the kernel from your current PR? I experimented with it, but encountered many issues. Would you have time to work on this？ Thank you very much.

I can try to add an option for F16 accumulator in the examples after the PR is merged.

solrex · 2025-07-31T07:04:39Z

Thank you very much for your work. Recently, I saw a CUTLASS PR #2378 that uses FP16 accumulation on SM89. Do you think this can be combined with the kernel from your current PR? I experimented with it, but encountered many issues. Would you have time to work on this？ Thank you very much.

@sunjianxide I've attempted it, and it compiles successfully, but it's failing at runtime. I haven't been able to diagnose the cause yet. You can check the patch:

diff --git a/examples/85_ada_ampere_gemm_with_blockwise_scaling/85a_ada_fp8_gemm_with_groupwise_scaling_cute.cu b/examples/85_ada_ampere_gemm_with_blockwise_scaling/85a_ada_fp8_gemm_with_groupwise_scaling_cute.cu
index f9078d0c..089eddac 100644
--- a/examples/85_ada_ampere_gemm_with_blockwise_scaling/85a_ada_fp8_gemm_with_groupwise_scaling_cute.cu
+++ b/examples/85_ada_ampere_gemm_with_blockwise_scaling/85a_ada_fp8_gemm_with_groupwise_scaling_cute.cu
@@ -73,7 +73,7 @@
 
 using namespace cute;
 
-template <typename ArchTag, typename Element, int BLK_M, int BLK_N, int BLK_K, int PipelineStages = 3, int WARP_M = 2, int WARP_N = 2>
+template <typename ArchTag, typename Element, typename Accumulator, int BLK_M, int BLK_N, int BLK_K, int PipelineStages = 3, int WARP_M = 2, int WARP_N = 2>
 struct SM8x_Byte_Gemm_Traits {
   static constexpr int MMA_WARP_M = WARP_M * 16;
   static constexpr int MMA_WARP_N = WARP_N * 16;
@@ -90,7 +90,10 @@ struct SM8x_Byte_Gemm_Traits {
   );
 
   using MmaAtom = cute::conditional_t<cute::is_same_v<Element, cutlass::float_e4m3_t>,
-    MMA_Atom<SM89_16x8x32_F32E4M3E4M3F32_TN>,
+    cute::conditional_t<cute::is_same_v<Accumulator, cutlass::half_t>,
+      MMA_Atom<SM89_16x8x32_F16E4M3E4M3F16_TN>,
+      MMA_Atom<SM89_16x8x32_F32E4M3E4M3F32_TN>
+    >,
     MMA_Atom<SM80_16x8x32_S32S8S8S32_TN>>;
 
   using TileShape = Shape<Int<BLK_M>, Int<BLK_N>, Int<BLK_K>>;              // Threadblock-level tile size
@@ -154,7 +157,7 @@ constexpr int PipelineStages = 4;
 constexpr int BLK_M = 64;
 constexpr int BLK_N = 128;
 constexpr int BLK_K = 128;
-using GemmTrait = SM8x_Byte_Gemm_Traits<ArchTag, ElementA, BLK_M, BLK_N, BLK_K, PipelineStages>;
+using GemmTrait = SM8x_Byte_Gemm_Traits<ArchTag, ElementA, cutlass::half_t, BLK_M, BLK_N, BLK_K, PipelineStages>;
 using TileShape = GemmTrait::TileShape;
 using DispatchPolicy = cutlass::gemm::MainloopSm80CpAsyncBlockScaling<PipelineStages, GemmTrait::ClusterShape>;
 
diff --git a/include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp b/include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp
index 8461657b..37c19693 100644
--- a/include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp
+++ b/include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp
@@ -154,8 +154,10 @@ struct CollectiveMma<
 
   // Block scaling gmem-to-smem copy atom
   //  we can have partial tiles in M or N, so don't vectorize those loads
-  using CopyAtomSFA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
-  using CopyAtomSFB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
+  using CopyAtomSFA = cute::conditional_t<sizeof(ElementBlockScale) < 4,
+    Copy_Atom<UniversalCopy<ElementBlockScale>, ElementBlockScale>,
+    Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>>;
+  using CopyAtomSFB = CopyAtomSFA;
   using GmemTiledCopySFA = decltype(make_tiled_copy(
     CopyAtomSFA{},
     Layout<Shape<Int<32>>>{},

solrex changed the title ~~Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 block/groupwise on Ampere~~ Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere May 24, 2025

solrex force-pushed the sm80-blockscale branch from 2b2a88b to 5c58e77 Compare May 26, 2025 18:03

guyan364 reviewed May 28, 2025

View reviewed changes

include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp Outdated Show resolved Hide resolved

github-actions bot added the inactive-30d label Jun 27, 2025

solrex added 7 commits July 10, 2025 15:03

Add SM80/89 blockwise kernel, support:

0d18c46

* FP8 blockwise/groupwise kernel for Ada(L20,L40S,4090) * INT8 blockwise/groupwise kernel for Ampere(A100/800)

Set the element types of EpilogueOp more clearly.

55fa534

Add Traits for different block size.

0ec01b4

Avoid unnecessary copy in for loop.

b3d6426

Fix scale factor residue calculation.

6bb4845

Avoid overflow calculation.

90d5bff

Rollback SFA/B copy thread num to 32, fix m*n (m>1,n>1) scale missing.

9f7badd

github-actions bot removed the inactive-30d label Jul 10, 2025

Use load_sf* flags to limit threads that perform clear sf*.

ab3b26e

solrex force-pushed the sm80-blockscale branch from 3521a01 to ab3b26e Compare July 10, 2025 07:27

Merge remote-tracking branch 'upstream/main' into sm80-blockscale

79a62df

solrex added 3 commits August 27, 2025 16:06

Minor speedup.

7ad6cf4

Merge branch 'main' into sm80-blockscale

41c5564

Minor speedup.

ee2b612

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Uh oh!

solrex commented May 24, 2025

Uh oh!

hwu36 commented May 28, 2025

Uh oh!

Uh oh!

solrex commented May 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

sunjianxide commented Jul 10, 2025

Uh oh!

solrex commented Jul 10, 2025

Uh oh!

solrex commented Jul 31, 2025

Uh oh!

Uh oh!

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Are you sure you want to change the base?

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Uh oh!

Conversation

solrex commented May 24, 2025

Uh oh!

hwu36 commented May 28, 2025

Uh oh!

Uh oh!

solrex commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

sunjianxide commented Jul 10, 2025

Uh oh!

solrex commented Jul 10, 2025

Uh oh!

solrex commented Jul 31, 2025

Uh oh!

Uh oh!

solrex commented May 28, 2025 •

edited

Loading