Skip to content

Conversation

@shwina
Copy link
Contributor

@shwina shwina commented Jan 1, 2026

Description

This PR adds a ShuffleIterator implementation that uses a Feistel network based permutation function.

Note - the implementation is completely vibe-coded, but I was pleased to find that it more or less matches the libcudacxx approach .

Performance

As reported in cupy/cupy#9320, cp.random.choice(N) allocates a temporary of size N which can be slow and memory-inefficient when randomly drawing a small number of values from a large range. The stateless approach used by ShuffleIterator solves that problem:

import cupy as cp
import numpy as np

def run_cupy_choice(N: int, K: int, seed: int):
    out = cp.random.choice(N, size=K, replace=False)
    cp.cuda.runtime.deviceSynchronize()
    return out


def run_cuda_compute_choice(N: int, K: int, seed: int):
    from cuda.compute import ShuffleIterator, unary_transform
    shuffle_it = ShuffleIterator(N, seed)
    out = cp.empty(K, dtype=np.int64)
    unary_transform(shuffle_it, out, lambda x: x, K)
    cp.cuda.runtime.deviceSynchronize()
    return out
%timeit run_cupy_choice(100_000_000, 1_000, 42)
34.1 ms ± 16.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit run_cuda_compute_choice(100_000_000, 1_000, 42)
15.3 μs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 1, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Jan 1, 2026
@shwina
Copy link
Contributor Author

shwina commented Jan 1, 2026

/ok to test aefdebe

@shwina
Copy link
Contributor Author

shwina commented Jan 1, 2026

/ok to test 612b25c

@github-actions

This comment has been minimized.

@shwina shwina marked this pull request as ready for review January 5, 2026 11:34
@shwina shwina requested review from a team as code owners January 5, 2026 11:34
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 5, 2026
@github-actions

This comment has been minimized.

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really speed to CCCL and how its python code is organised tested but some comments on the algorithm itself.

  • We should use the Philox hash that is developed and tested in the paper https://arxiv.org/abs/2106.06161
  • We need to statistically test the output distributions as this is a PRNG algorithm. Some ideas below.

For a very small permutation size (<5) count the number of times each permutation is generated. This output distrubition should be uniform and can be tested with a chi-squared test with scipy.

Generate a bunch of permutations - take the value of each permutation at index 0. This sequence should be a uniform integer distribution. Test it using scipy, repeat for each index.

The paper also gives a stronger statistical test but its more complicated to implement as we need the mallows kernel.

Limitations
-----------
- The resulting permutation is *not* uniformly sampled from all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement kind of feels like it might be correct but I think its nonsense.

- The resulting permutation is *not* uniformly sampled from all
``num_items!`` permutations. It is drawn from a large, structured family
of permutations induced by the Feistel construction.
- Cycle-walking may apply the Feistel permutation more than once per element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user doesn't care about the implementation detail of cycle walking. One consequence of this so called "cycle-walking" is the the worst case runtime for generating an element is O(n) where n is the permutation length - it is just vanishing unlikely. I am not sure we even need to mention this.

return (R << hb) | L


def _splitmix64_host(x: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using splitmix is in theory fine, but we should use the VariablePhilox algorithm from this paper: https://arxiv.org/abs/2106.06161. This implemention needs to round up the sequence to the nearest power of 4, the one from the paper is the nearest power of 2. And its tested.

result = d_output.get()

# Should be a valid permutation
assert len(set(result)) == num_items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the sorted permutation against cp.arange for a faster test.

@gevtushenko gevtushenko linked an issue Jan 5, 2026 that may be closed by this pull request
@shwina
Copy link
Contributor Author

shwina commented Jan 5, 2026

Thanks for the great feedback, @RAMitchell.

I've updated the implementation to use the VariablePhilox algorithm as mentioned; and to match the libcu++ implementation, we do 24 Feistel rounds.

However, I still don't see strictly uniform distribution of permutations using the test you suggested. For n=4 (24 different permutations), and samples=24000 (expected 1000 samples per permutation), here's what the distribution looks like:

Screenshot 2026-01-05 at 3 24 18 PM

And as another case, for n=5, and samples=120000:

Screenshot 2026-01-05 at 3 43 14 PM

So, while "good enough for government work", it's not as rigorous as maybe you would like.

Now, I also noted that the C++ implementation suffers from a similar issue.

The validation scripts, along with instructions for how to run them are in this gist: https://gist.github.com/shwina/2508ed08bc7c257dcf1834dfa574d7e6

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

😬 CI Workflow Results

🟥 Finished in 6h 41m: Pass: 79%/48 | Total: 2d 20h | Max: 6h 00m

See results here.

@shwina shwina self-assigned this Jan 6, 2026
return make_permutation_iterator(values, indices)


def ShuffleIterator(num_items, seed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API design:

  • As a library developer, being able to pass my internal seed is great
  • As an end user, I sometimes don't want to think about it and would prefer if seed=None (default), cuda.compute generates a random seed for me internally.

@leofang
Copy link
Member

leofang commented Jan 6, 2026

I can reproduce the observation in #7062 (comment) with cupy/cupy#9570 (re-implementation of @RAMitchell's thrust::shuffle_iterator), by reusing @shwina's validation.py and replacing this

shuffle_it = ShuffleIterator(n, seed)
cuda.compute.unary_transform(shuffle_it, d_perm, lambda x: x, n)

by this

keys = np.random.randint(
    0, 0xFFFFFFFF + 1, size=24, dtype=np.uint32)
bijection = FeistelBijection(n, keys)
d_perm = bijection(n)

Since we use different RNGs, this verifies that the observation is NOT due to the choice of RNG.

@leofang
Copy link
Member

leofang commented Jan 6, 2026

(updated my comment above)

@RAMitchell
Copy link
Contributor

Thanks @leofang, I've done the same experiment and something is definitely not quite right, either with our Philox implementation or there are simply too few bits at low sizes for this hash to work as intended. I will do some more investigation.

@RAMitchell
Copy link
Contributor

I've tried a number of different approaches today, including strengthening the hash functions in the feistel cipher. The thing that matters in the end is just the minimum size of the sequence. 4 bits as a minimum seems to be too low, increasing it to as few as 6 bits seems to give the cipher enough to work with and passes uniformity tests. This means that we need to throw away more values for small sequences but this seems ok.

@shwina
Copy link
Contributor Author

shwina commented Jan 7, 2026

@RAMitchell Thanks for taking a look. To be clear, does this mean we are OK with non strict-uniformity at lower sequence sizes; or do we need changes to the implementation at the C++ level?

Tangentially, I've realized that we don't strictly need to reimplement the functionality in Python. Similar to what we do in cuda.coop, it's possible to reuse the existing functionality provided in feistel_bijection.h from Python to implement ShuffleIterator. So any changes to the C++ implementation will be picked up here automatically. I'll push an update to the PR that uses this implementation soon.

@RAMitchell
Copy link
Contributor

No, we can have uniformity. We just have to use e.g. a bijection for 256 elements to generate a permutation of length 5. The tiny bijections don't work well with the feistel cipher.

@shwina
Copy link
Contributor Author

shwina commented Jan 7, 2026

OK, would you like to push a fix for that in a separate PR, or would you like me to do that in this one? (it will probably take me more cycles)

@RAMitchell
Copy link
Contributor

I will do it in a separate PR thanks. I would like to add a bunch more tests as well.

@alliepiper alliepiper removed their request for review January 14, 2026 00:18
Copy link
Contributor

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realized that we don't strictly need to reimplement the functionality in Python

Yes please try to reuse the C++ implementation so we don't have to duplicate the correctness tests in python.
C++ supports custom bijections but I don't think we would need to expose that functionality here

cp.testing.assert_array_equal(d_output1, d_output2)


def test_shuffle_iterator_different_seeds():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra tests I would consider:

  1. Check the result != identity permutation
  2. Check a few different seeds and assert every value is moved in at least some of them

Lets have at least a couple of basic uniformity test:

  • Check all permutations of 5-6 elements is uniformly selected (the only true test)
  • Pick one element and check it lands at every location uniformly (ok sanity check for long sequences)

Otherwise lets reuse C++ and rely on their tests

@djns99
Copy link
Contributor

djns99 commented Jan 20, 2026

However, I still don't see strictly uniform distribution of permutations using the test you suggested.

I'll note that we don't expect to see a perfectly flat line, we should see a slight S shape. But over 100 difference is definitely high. e.g this is a plot of randomly generated values between 0, 120

random_plot

@shwina shwina marked this pull request as draft January 20, 2026 15:43
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Jan 20, 2026
@shwina
Copy link
Contributor Author

shwina commented Jan 20, 2026

Thank you @djns99 - I can confirm I do see better S shaped curves when reusing the (new, fixed) underlying C++ implementation, with less variance.

I will mark this PR as draft, as I'd like to land some other bits of functionality that will allow us to better reuse the underlying C++ implementation first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

cuda.cccl.parallel: Add shuffle_iterator

4 participants