Hi! Just wanted to say first, incredible blog post! There's suprisingly a lack of great docs on complex kernels in hopper, so its bee super helpful for us at luminal as we build our compiler.
Second, I understand the logic behind the more complex schedule for maximising L2 cache hits, but for some reason when I try to switch schedules to see the expected slowdown, i don't see it on an H100 SXM, Build cuda_12.8.r12.8/compiler.35583870_0. On kernel 6 i see 684.3 tflops with schedule 1 and 667.6 with schedule 0. So there is some slowdown but nowhere near what i expected or what you saw when writing your post. Any thoughts on why this could be?