Not seeing peformance divergence between Schedule<0...> and Schedule<1...>

Hi! Just wanted to say first, incredible blog post! There's suprisingly a lack of great docs on complex kernels in hopper, so its bee super helpful for us at luminal as we build our compiler.

Second, I understand the logic behind the more complex schedule for maximising L2 cache hits, but for some reason when I try to switch schedules to see the expected slowdown, i don't see it on an H100 SXM, `Build cuda_12.8.r12.8/compiler.35583870_0`. On kernel 6 i see `684.3` tflops with schedule 1 and `667.6` with schedule 0. So there is some slowdown but nowhere near what i expected or what you saw when writing your post. Any thoughts on why this could be?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not seeing peformance divergence between Schedule<0...> and Schedule<1...> #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Not seeing peformance divergence between Schedule<0...> and Schedule<1...> #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions