Skip to content

Conversation

@shwina
Copy link
Contributor

@shwina shwina commented Jan 1, 2026

Description

This PR adds a ShuffleIterator implementation that uses a Feistel network based permutation function.

Note - the implementation is completely vibe-coded, but I was pleased to find that it more or less matches the libcudacxx approach .

Performance

As reported in cupy/cupy#9320, cp.random.choice(N) allocates a temporary of size N which can be slow and memory-inefficient when randomly drawing a small number of values from a large range. The stateless approach used by ShuffleIterator solves that problem:

import cupy as cp
import numpy as np

def run_cupy_choice(N: int, K: int, seed: int):
    out = cp.random.choice(N, size=K, replace=False)
    cp.cuda.runtime.deviceSynchronize()
    return out


def run_cuda_compute_choice(N: int, K: int, seed: int):
    from cuda.compute import ShuffleIterator, unary_transform
    shuffle_it = ShuffleIterator(N, seed)
    out = cp.empty(K, dtype=np.int64)
    unary_transform(shuffle_it, out, lambda x: x, K)
    cp.cuda.runtime.deviceSynchronize()
    return out
%timeit run_cupy_choice(100_000_000, 1_000, 42)
34.1 ms ± 16.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit run_cuda_compute_choice(100_000_000, 1_000, 42)
15.3 μs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 1, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Jan 1, 2026
@shwina
Copy link
Contributor Author

shwina commented Jan 1, 2026

/ok to test aefdebe

@shwina
Copy link
Contributor Author

shwina commented Jan 1, 2026

/ok to test 612b25c

@github-actions

This comment has been minimized.

@shwina shwina marked this pull request as ready for review January 5, 2026 11:34
@shwina shwina requested review from a team as code owners January 5, 2026 11:34
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 5, 2026
@github-actions

This comment has been minimized.

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really speed to CCCL and how its python code is organised tested but some comments on the algorithm itself.

  • We should use the Philox hash that is developed and tested in the paper https://arxiv.org/abs/2106.06161
  • We need to statistically test the output distributions as this is a PRNG algorithm. Some ideas below.

For a very small permutation size (<5) count the number of times each permutation is generated. This output distrubition should be uniform and can be tested with a chi-squared test with scipy.

Generate a bunch of permutations - take the value of each permutation at index 0. This sequence should be a uniform integer distribution. Test it using scipy, repeat for each index.

The paper also gives a stronger statistical test but its more complicated to implement as we need the mallows kernel.

Limitations
-----------
- The resulting permutation is *not* uniformly sampled from all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement kind of feels like it might be correct but I think its nonsense.

- The resulting permutation is *not* uniformly sampled from all
``num_items!`` permutations. It is drawn from a large, structured family
of permutations induced by the Feistel construction.
- Cycle-walking may apply the Feistel permutation more than once per element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user doesn't care about the implementation detail of cycle walking. One consequence of this so called "cycle-walking" is the the worst case runtime for generating an element is O(n) where n is the permutation length - it is just vanishing unlikely. I am not sure we even need to mention this.

return (R << hb) | L


def _splitmix64_host(x: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using splitmix is in theory fine, but we should use the VariablePhilox algorithm from this paper: https://arxiv.org/abs/2106.06161. This implemention needs to round up the sequence to the nearest power of 4, the one from the paper is the nearest power of 2. And its tested.

result = d_output.get()

# Should be a valid permutation
assert len(set(result)) == num_items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the sorted permutation against cp.arange for a faster test.

@gevtushenko gevtushenko linked an issue Jan 5, 2026 that may be closed by this pull request
@shwina
Copy link
Contributor Author

shwina commented Jan 5, 2026

Thanks for the great feedback, @RAMitchell.

I've updated the implementation to use the VariablePhilox algorithm as mentioned; and to match the libcu++ implementation, we do 24 Feistel rounds.

However, I still don't see strictly uniform distribution of permutations using the test you suggested. For n=4 (24 different permutations), and samples=24000 (expected 1000 samples per permutation), here's what the distribution looks like:

Screenshot 2026-01-05 at 3 24 18 PM

And as another case, for n=5, and samples=120000:

Screenshot 2026-01-05 at 3 43 14 PM

So, while "good enough for government work", it's not as rigorous as maybe you would like.

Now, I also noted that the C++ implementation suffers from a similar issue.

The validation scripts, along with instructions for how to run them are in this gist: https://gist.github.com/shwina/2508ed08bc7c257dcf1834dfa574d7e6

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

😬 CI Workflow Results

🟥 Finished in 6h 41m: Pass: 79%/48 | Total: 2d 20h | Max: 6h 00m

See results here.

@shwina shwina self-assigned this Jan 6, 2026
return make_permutation_iterator(values, indices)


def ShuffleIterator(num_items, seed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API design:

  • As a library developer, being able to pass my internal seed is great
  • As an end user, I sometimes don't want to think about it and would prefer if seed=None (default), cuda.compute generates a random seed for me internally.

@leofang
Copy link
Member

leofang commented Jan 6, 2026

I can reproduce the observation in #7062 (comment) with cupy/cupy#9570 (re-implementation of @RAMitchell's thrust::shuffle_iterator), by reusing @shwina's validation.py and replacing this

shuffle_it = ShuffleIterator(n, seed)
cuda.compute.unary_transform(shuffle_it, d_perm, lambda x: x, n)

by this

keys = np.random.randint(
    0, 0xFFFFFFFF + 1, size=24, dtype=np.uint32)
bijection = FeistelBijection(n, keys)
d_perm = bijection(n)

Since we use different RNGs, this verifies that the observation is NOT due to the choice of RNG.

@leofang
Copy link
Member

leofang commented Jan 6, 2026

(updated my comment above)

@RAMitchell
Copy link
Contributor

Thanks @leofang, I've done the same experiment and something is definitely not quite right, either with our Philox implementation or there are simply too few bits at low sizes for this hash to work as intended. I will do some more investigation.

@RAMitchell
Copy link
Contributor

I've tried a number of different approaches today, including strengthening the hash functions in the feistel cipher. The thing that matters in the end is just the minimum size of the sequence. 4 bits as a minimum seems to be too low, increasing it to as few as 6 bits seems to give the cipher enough to work with and passes uniformity tests. This means that we need to throw away more values for small sequences but this seems ok.

@shwina
Copy link
Contributor Author

shwina commented Jan 7, 2026

@RAMitchell Thanks for taking a look. To be clear, does this mean we are OK with non strict-uniformity at lower sequence sizes; or do we need changes to the implementation at the C++ level?

Tangentially, I've realized that we don't strictly need to reimplement the functionality in Python. Similar to what we do in cuda.coop, it's possible to reuse the existing functionality provided in feistel_bijection.h from Python to implement ShuffleIterator. So any changes to the C++ implementation will be picked up here automatically. I'll push an update to the PR that uses this implementation soon.

@RAMitchell
Copy link
Contributor

No, we can have uniformity. We just have to use e.g. a bijection for 256 elements to generate a permutation of length 5. The tiny bijections don't work well with the feistel cipher.

@shwina
Copy link
Contributor Author

shwina commented Jan 7, 2026

OK, would you like to push a fix for that in a separate PR, or would you like me to do that in this one? (it will probably take me more cycles)

@RAMitchell
Copy link
Contributor

I will do it in a separate PR thanks. I would like to add a bunch more tests as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

cuda.cccl.parallel: Add shuffle_iterator

3 participants