faster smem descriptor by gabriel-ambrosio · Pull Request #5 · pranjalssh/fast.cu

gabriel-ambrosio · 2025-05-31T00:07:36Z

the current function works as:

device uint64_t make_smem_desc(bf16* ptr) {
// Convert shared memory pointer to integer
uint32_t addr = static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
uint64_t desc = matrix_descriptor_encode(addr);
desc |= matrix_descriptor_encode((uint64_t)16) << 16;
desc |= matrix_descriptor_encode((uint64_t)1024) << 32;
desc |= 1llu << 62; // 128B swizzle
return desc;
}

which does a lot of unecessary computations that can be otimized.

the line "desc |= matrix_descriptor_encode((uint64_t)16) << 16;" can be completely taken out as on the ptx documentation it says that for K-major swizzled layouts "not used, assumed to be 1." so it is not necessary to write any number(can remain 0-ed)

the line "desc |= matrix_descriptor_encode((uint64_t)1024) << 32;" does the following computation
1024 & 0x3FFFF = 1024 -> 1024 >> 16 = 64 every time that the function is called so we can change that for 0x0000004000000000
so when the "or" operation happens it will change only its value on the descriptor

the line "desc |= 1llu << 62; // 128B swizzle" will always be 0x4000000000000000 so it is unecessary to do the left shift

this way we can combine our last two alterations, making the descriptor "0x4000004000000000" and then we will only need to compute our addr

Logically, there is no need to this function exist so we can delete it completely and write a logical "or" for every smem descriptor created like showed bellow:

...
uint64_t desc_a = 0x4000004000000000 | (matrix_descriptor_encode(static_cast<uint32_t>(__cvta_generic_to_shared(&sA[0]))));
uint64_t desc_b = 0x4000004000000000 | (matrix_descriptor_encode(static_cast<uint32_t>(__cvta_generic_to_shared(&sB[0]))));

...

This way improving the performance of the kernel by about 1% :)

…e that will be always the same for improved performance

discarted unecessary computations by writing the smem descriptor valu…

71cd53e

…e that will be always the same for improved performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster smem descriptor#5

faster smem descriptor#5
gabriel-ambrosio wants to merge 1 commit intopranjalssh:mainfrom
gabriel-ambrosio:faster-smem-desc

gabriel-ambrosio commented May 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabriel-ambrosio commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabriel-ambrosio commented May 31, 2025 •

edited

Loading