Skip to content

Improve kernel 2#8

Open
gordicaleksa wants to merge 1 commit intopranjalssh:mainfrom
gordicaleksa:pr_kernel2
Open

Improve kernel 2#8
gordicaleksa wants to merge 1 commit intopranjalssh:mainfrom
gordicaleksa:pr_kernel2

Conversation

@gordicaleksa
Copy link
Contributor

@gordicaleksa gordicaleksa commented Sep 16, 2025

Few modifications:

  1. Remove __syncthreads after mbarrier wait -> it's superfluous (doesn't really affect performance, but it's an unnecessary bloat)
  2. Bump up alignment requirement from 128 -> 1024, dictated by the 128B swizzle pattern
  3. Add comments and remove superfluous padding values for tensor map, we're dealing with a 2D not 5D tensor - might confuse people! :)

Also importantly i think the way tensor map is currently passed is incorrect (and it's likely one of those bugs that show up 0.01% of time).

The docs describe 3 ways to pass in tensor map to the kernel, see this chapter. You've currently chosen to copy tensor map to global memory (using cudaMemcpy), from docs verbatim:

"Finally, it is possible to copy the tensor map to global memory. Using a pointer to a tensor map in global device memory requires a fence in each thread block before any thread in the block uses the updated tensor map. Further uses of the tensor map by that thread block do not need to be fenced unless the tensor map is modified again. Note that this mechanism may be slower than the two mechanisms described above."

But the current code does not have this part included:

 // Fence acquire tensor map:
  ptx::n32_t<128> size_bytes;
  // Since the tensor map was modified from the host using cudaMemcpy,
  // the scope should be .sys.
  ptx::fence_proxy_tensormap_generic(
     ptx::sem_acquire, ptx::scope_sys, tensor_map, size_bytes
 );

I tested this approach and indeed it does slow down the kernel by ~5 TFLOP/s. I also tested the grid_constant approach (which is recommended) and it's pretty much on par in terms of speed but didn't want to push the change until i see if you're even accepting PRs. :P Also I might be misinterpreting something here.

thanks Pranjal!

@gordicaleksa
Copy link
Contributor Author

ok i see now you've used grid constant in the later kernels! :)

@pranjalssh
Copy link
Owner

Yeah this is a very initial prototype. Maybe we can skip this one, later we use upto 3d tma. 5d makes it easy to copy/paste and anything in future.

I am fine on removing syncthreads, but this is mostly a basic prototype

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants