Skip to content

Conversation

@jwake
Copy link

@jwake jwake commented May 25, 2025

Hi!

I was running into a bunch of invalid memory accesses trying to test out the CUDA variable rate decompression support in staging, and I believe I've tracked it down to the index block offset calculation using a warp-sized but thread-block-shared-memory offset array, causing the remaining warps in the thread block to clobber the offset calculation.

I've replaced it with a basic warp-level prefix sum and an assertion to ensure that the partition size is equal to the warp size; it'd probably be better to re-do this with something from the cooperative_groups namespace so it can support partition sizes of any power of 2 up to the thread block size, but this should be sufficient to start getting useful results from decompression.

@lindstro
Copy link
Member

@jwake Thanks for your contribution. The parallel decoding implementation is currently wholly broken and will be rewritten over the coming months. We're currently re-engineering the block index representation and API, after which we will tackle decompression. During this time, do not expect the staging branch to work correctly until it has been finished and merged into develop.

@jwake
Copy link
Author

jwake commented May 27, 2025

For what it's worth, I managed to get the parallel decoder working fairly well on CUDA (~70GB/sec on an A100 with buffers in device memory, though I'll note I'm only testing 3D at the moment) between this change and manually bodging the stream_rseek call at the end of zfp_internal_cuda_decompress to reset the pointer offset without trying to read from the stream (in my case, the stream was in device memory, so rseek internally trying to read the buffer would fail; short of a separate device-aware bitstream implementation I'm not sure what the best approach here would be from an API standpoint), and then manually saving/restoring the appropriate index data externally after compression / before decompression. I did note that further compressing the index data with eg. LZ4 or Deflate netted a further ~15% size improvement on the index data but with appropriate granularity the index was already pretty small compared to the bulk of the compressed data.

As an aside, adding a zfp_mode_expert clause to the mode check in zfp_internal_cuda_[de]compress to set up the index as appropriate also allowed combining maxbits with maxprec / minexp to work as expected in parallel on CUDA - my use of the API is operating in a strictly bounded-memory environment where we've previously been using fixed-rate due to using CUDA, so being able to move to a variable-rate-with-bounded-size mode is pretty exciting.

Completely understand re: expected brokenness on the branch - I just wanted to see if I could get it to work and it's honestly looking pretty good all round. Eager to see the next major release, and thanks for all the hard work!

@lindstro
Copy link
Member

The purpose of the stream_rseek call is to update the bitstream state in host memory so that when you return from zfp_decompress, the stream is positioned correctly for the next zfp_decompress call. No need to update any device data structures here.

Past experiments suggest that a granularity larger than 1, while helpful in reducing index size, is likely to degrade performance quite a bit. Also, it's incompatible with the index data structures needed for zfp's const_array classes, so it may not be available in the future. It definitely would be incompatible with some index representations we're considering.

There's no good reason I can think of why we would not support expert mode. Really the distinction should be between fixed and variable rate (and reversible mode, which uses a different algorithm), which is how we're distinguishing modes during compression. So I expect expert mode will be supported.

We'll hopefully get this cleaned up and working soon. While we certainly appreciate external contributions, I'm not sure it makes sense to merge this PR as we're already planning to rewrite major parts of this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants