Skip to content

Not seeing peformance divergence between Schedule<0...> and Schedule<1...> #12

@jafioti

Description

@jafioti

Hi! Just wanted to say first, incredible blog post! There's suprisingly a lack of great docs on complex kernels in hopper, so its bee super helpful for us at luminal as we build our compiler.

Second, I understand the logic behind the more complex schedule for maximising L2 cache hits, but for some reason when I try to switch schedules to see the expected slowdown, i don't see it on an H100 SXM, Build cuda_12.8.r12.8/compiler.35583870_0. On kernel 6 i see 684.3 tflops with schedule 1 and 667.6 with schedule 0. So there is some slowdown but nowhere near what i expected or what you saw when writing your post. Any thoughts on why this could be?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions