Skip to content

faster smem descriptor#5

Open
gabriel-ambrosio wants to merge 1 commit intopranjalssh:mainfrom
gabriel-ambrosio:faster-smem-desc
Open

faster smem descriptor#5
gabriel-ambrosio wants to merge 1 commit intopranjalssh:mainfrom
gabriel-ambrosio:faster-smem-desc

Conversation

@gabriel-ambrosio
Copy link

@gabriel-ambrosio gabriel-ambrosio commented May 31, 2025

the current function works as:

device uint64_t make_smem_desc(bf16* ptr) {
// Convert shared memory pointer to integer
uint32_t addr = static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
uint64_t desc = matrix_descriptor_encode(addr);
desc |= matrix_descriptor_encode((uint64_t)16) << 16;
desc |= matrix_descriptor_encode((uint64_t)1024) << 32;
desc |= 1llu << 62; // 128B swizzle
return desc;
}

which does a lot of unecessary computations that can be otimized.

the line "desc |= matrix_descriptor_encode((uint64_t)16) << 16;" can be completely taken out as on the ptx documentation it says that for K-major swizzled layouts "not used, assumed to be 1." so it is not necessary to write any number(can remain 0-ed)

the line "desc |= matrix_descriptor_encode((uint64_t)1024) << 32;" does the following computation
1024 & 0x3FFFF = 1024 -> 1024 >> 16 = 64 every time that the function is called so we can change that for 0x0000004000000000
so when the "or" operation happens it will change only its value on the descriptor

the line "desc |= 1llu << 62; // 128B swizzle" will always be 0x4000000000000000 so it is unecessary to do the left shift

this way we can combine our last two alterations, making the descriptor "0x4000004000000000" and then we will only need to compute our addr

Logically, there is no need to this function exist so we can delete it completely and write a logical "or" for every smem descriptor created like showed bellow:

...
uint64_t desc_a = 0x4000004000000000 | (matrix_descriptor_encode(static_cast<uint32_t>(__cvta_generic_to_shared(&sA[0]))));
uint64_t desc_b = 0x4000004000000000 | (matrix_descriptor_encode(static_cast<uint32_t>(__cvta_generic_to_shared(&sB[0]))));

...

This way improving the performance of the kernel by about 1% :)

…e that will be always the same for improved performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant