Skip to content

Conversation

@ashvardanian
Copy link
Owner

@ashvardanian ashvardanian commented Oct 29, 2025

All of our algorithms on GPU extensively leverage shared memory and warp-level synchronization. That might be suboptimal for very small input sizes. There, we should keep everything in the registers.

So I'm suggesting a new set of kernels processing the DP matrix row-by-row, but keep only one row and one scalar in GPU registers. Moreover, it process the matrix of uint8_t cells in slices of 4 continuous entries forming uint32_t-s in each row. That minimizes the number of loads & stores, conversions between 8-bit and 32-bit representations - assuming GPUs don't have much custom logic for 8-bit times and upcast practically every time.

@ashvardanian ashvardanian changed the base branch from main to main-dev October 29, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants