CUDA variable rate decompression hybrid index warp-level prefix sum #268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
I was running into a bunch of invalid memory accesses trying to test out the CUDA variable rate decompression support in staging, and I believe I've tracked it down to the index block offset calculation using a warp-sized but thread-block-shared-memory offset array, causing the remaining warps in the thread block to clobber the offset calculation.
I've replaced it with a basic warp-level prefix sum and an assertion to ensure that the partition size is equal to the warp size; it'd probably be better to re-do this with something from the cooperative_groups namespace so it can support partition sizes of any power of 2 up to the thread block size, but this should be sufficient to start getting useful results from decompression.