Improve kernel 2 by gordicaleksa · Pull Request #8 · pranjalssh/fast.cu

gordicaleksa · 2025-09-16T19:35:31Z

Few modifications:

Remove __syncthreads after mbarrier wait -> it's superfluous (doesn't really affect performance, but it's an unnecessary bloat)
Bump up alignment requirement from 128 -> 1024, dictated by the 128B swizzle pattern
Add comments and remove superfluous padding values for tensor map, we're dealing with a 2D not 5D tensor - might confuse people! :)

Also importantly i think the way tensor map is currently passed is incorrect (and it's likely one of those bugs that show up 0.01% of time).

The docs describe 3 ways to pass in tensor map to the kernel, see this chapter. You've currently chosen to copy tensor map to global memory (using cudaMemcpy), from docs verbatim:

"Finally, it is possible to copy the tensor map to global memory. Using a pointer to a tensor map in global device memory requires a fence in each thread block before any thread in the block uses the updated tensor map. Further uses of the tensor map by that thread block do not need to be fenced unless the tensor map is modified again. Note that this mechanism may be slower than the two mechanisms described above."

But the current code does not have this part included:

 // Fence acquire tensor map:
  ptx::n32_t<128> size_bytes;
  // Since the tensor map was modified from the host using cudaMemcpy,
  // the scope should be .sys.
  ptx::fence_proxy_tensormap_generic(
     ptx::sem_acquire, ptx::scope_sys, tensor_map, size_bytes
 );

I tested this approach and indeed it does slow down the kernel by ~5 TFLOP/s. I also tested the grid_constant approach (which is recommended) and it's pretty much on par in terms of speed but didn't want to push the change until i see if you're even accepting PRs. :P Also I might be misinterpreting something here.

thanks Pranjal!

gordicaleksa · 2025-09-17T03:41:45Z

ok i see now you've used grid constant in the later kernels! :)

pranjalssh · 2025-09-18T15:30:47Z

Yeah this is a very initial prototype. Maybe we can skip this one, later we use upto 3d tma. 5d makes it easy to copy/paste and anything in future.

I am fine on removing syncthreads, but this is mostly a basic prototype

Refactor, align to 1024, remove syncthreads

27fedab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve kernel 2#8

Improve kernel 2#8
gordicaleksa wants to merge 1 commit intopranjalssh:mainfrom
gordicaleksa:pr_kernel2

gordicaleksa commented Sep 16, 2025 •

edited

Loading

Uh oh!

gordicaleksa commented Sep 17, 2025

Uh oh!

pranjalssh commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gordicaleksa commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gordicaleksa commented Sep 17, 2025

Uh oh!

pranjalssh commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gordicaleksa commented Sep 16, 2025 •

edited

Loading