Add shuffle iterator implementation, tests, and example #7062

shwina · 2026-01-01T18:38:42Z

Description

This PR adds a ShuffleIterator implementation that uses a Feistel network based permutation function.

Note - the implementation is completely vibe-coded, but I was pleased to find that it more or less matches the libcudacxx approach .

Performance

As reported in cupy/cupy#9320, cp.random.choice(N) allocates a temporary of size N which can be slow and memory-inefficient when randomly drawing a small number of values from a large range. The stateless approach used by ShuffleIterator solves that problem:

import cupy as cp
import numpy as np

def run_cupy_choice(N: int, K: int, seed: int):
    out = cp.random.choice(N, size=K, replace=False)
    cp.cuda.runtime.deviceSynchronize()
    return out


def run_cuda_compute_choice(N: int, K: int, seed: int):
    from cuda.compute import ShuffleIterator, unary_transform
    shuffle_it = ShuffleIterator(N, seed)
    out = cp.empty(K, dtype=np.int64)
    unary_transform(shuffle_it, out, lambda x: x, K)
    cp.cuda.runtime.deviceSynchronize()
    return out

%timeit run_cupy_choice(100_000_000, 1_000, 42)
34.1 ms ± 16.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit run_cuda_compute_choice(100_000_000, 1_000, 42)
15.3 μs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-01-01T18:38:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shwina · 2026-01-01T18:39:01Z

/ok to test aefdebe

shwina · 2026-01-01T18:45:15Z

/ok to test 612b25c

RAMitchell

I can't really speed to CCCL and how its python code is organised tested but some comments on the algorithm itself.

We should use the Philox hash that is developed and tested in the paper https://arxiv.org/abs/2106.06161
We need to statistically test the output distributions as this is a PRNG algorithm. Some ideas below.

For a very small permutation size (<5) count the number of times each permutation is generated. This output distrubition should be uniform and can be tested with a chi-squared test with scipy.

Generate a bunch of permutations - take the value of each permutation at index 0. This sequence should be a uniform integer distribution. Test it using scipy, repeat for each index.

The paper also gives a stronger statistical test but its more complicated to implement as we need the mallows kernel.

RAMitchell · 2026-01-05T12:32:09Z

python/cuda_cccl/cuda/compute/iterators/_shuffle_iterator.py

+
+    Limitations
+    -----------
+    - The resulting permutation is *not* uniformly sampled from all


This statement kind of feels like it might be correct but I think its nonsense.

RAMitchell · 2026-01-05T12:34:50Z

python/cuda_cccl/cuda/compute/iterators/_shuffle_iterator.py

+    - The resulting permutation is *not* uniformly sampled from all
+      ``num_items!`` permutations. It is drawn from a large, structured family
+      of permutations induced by the Feistel construction.
+    - Cycle-walking may apply the Feistel permutation more than once per element


I think the user doesn't care about the implementation detail of cycle walking. One consequence of this so called "cycle-walking" is the the worst case runtime for generating an element is O(n) where n is the permutation length - it is just vanishing unlikely. I am not sure we even need to mention this.

RAMitchell · 2026-01-05T12:40:06Z

python/cuda_cccl/cuda/compute/iterators/_shuffle_iterator.py

+    return (R << hb) | L
+
+
+def _splitmix64_host(x: int) -> int:


I think using splitmix is in theory fine, but we should use the VariablePhilox algorithm from this paper: https://arxiv.org/abs/2106.06161. This implemention needs to round up the sequence to the nearest power of 4, the one from the paper is the nearest power of 2. And its tested.

RAMitchell · 2026-01-05T12:43:38Z

python/cuda_cccl/tests/compute/test_shuffle_iterator.py

+    result = d_output.get()
+
+    # Should be a valid permutation
+    assert len(set(result)) == num_items


Check the sorted permutation against cp.arange for a faster test.

…ariablePhilox)

shwina · 2026-01-05T21:20:16Z

Thanks for the great feedback, @RAMitchell.

I've updated the implementation to use the VariablePhilox algorithm as mentioned; and to match the libcu++ implementation, we do 24 Feistel rounds.

However, I still don't see strictly uniform distribution of permutations using the test you suggested. For n=4 (24 different permutations), and samples=24000 (expected 1000 samples per permutation), here's what the distribution looks like:

And as another case, for n=5, and samples=120000:

So, while "good enough for government work", it's not as rigorous as maybe you would like.

Now, I also noted that the C++ implementation suffers from a similar issue.

The validation scripts, along with instructions for how to run them are in this gist: https://gist.github.com/shwina/2508ed08bc7c257dcf1834dfa574d7e6

github-actions · 2026-01-06T03:22:19Z

😬 CI Workflow Results

🟥 Finished in 6h 41m: Pass: 79%/48 | Total: 2d 20h | Max: 6h 00m

See results here.

leofang · 2026-01-06T17:09:27Z

python/cuda_cccl/cuda/compute/iterators/_factories.py

    return make_permutation_iterator(values, indices)


+def ShuffleIterator(num_items, seed):


API design:

As a library developer, being able to pass my internal seed is great

As an end user, I sometimes don't want to think about it and would prefer if seed=None (default), cuda.compute generates a random seed for me internally.

leofang · 2026-01-06T18:03:30Z

I can reproduce the observation in #7062 (comment) with cupy/cupy#9570 (re-implementation of @RAMitchell's thrust::shuffle_iterator), by reusing @shwina's validation.py and replacing this

shuffle_it = ShuffleIterator(n, seed)
cuda.compute.unary_transform(shuffle_it, d_perm, lambda x: x, n)

by this

keys = np.random.randint(
    0, 0xFFFFFFFF + 1, size=24, dtype=np.uint32)
bijection = FeistelBijection(n, keys)
d_perm = bijection(n)

Since we use different RNGs, this verifies that the observation is NOT due to the choice of RNG.

leofang · 2026-01-06T18:40:40Z

(updated my comment above)

RAMitchell · 2026-01-06T21:34:42Z

Thanks @leofang, I've done the same experiment and something is definitely not quite right, either with our Philox implementation or there are simply too few bits at low sizes for this hash to work as intended. I will do some more investigation.

RAMitchell · 2026-01-07T10:26:01Z

I've tried a number of different approaches today, including strengthening the hash functions in the feistel cipher. The thing that matters in the end is just the minimum size of the sequence. 4 bits as a minimum seems to be too low, increasing it to as few as 6 bits seems to give the cipher enough to work with and passes uniformity tests. This means that we need to throw away more values for small sequences but this seems ok.

shwina · 2026-01-07T13:09:09Z

@RAMitchell Thanks for taking a look. To be clear, does this mean we are OK with non strict-uniformity at lower sequence sizes; or do we need changes to the implementation at the C++ level?

Tangentially, I've realized that we don't strictly need to reimplement the functionality in Python. Similar to what we do in cuda.coop, it's possible to reuse the existing functionality provided in feistel_bijection.h from Python to implement ShuffleIterator. So any changes to the C++ implementation will be picked up here automatically. I'll push an update to the PR that uses this implementation soon.

RAMitchell · 2026-01-07T13:52:18Z

No, we can have uniformity. We just have to use e.g. a bijection for 256 elements to generate a permutation of length 5. The tiny bijections don't work well with the feistel cipher.

shwina · 2026-01-07T13:56:33Z

OK, would you like to push a fix for that in a separate PR, or would you like me to do that in this one? (it will probably take me more cycles)

RAMitchell · 2026-01-07T14:31:29Z

I will do it in a separate PR thanks. I would like to add a bunch more tests as well.

Add shuffle iterator implementation, tests, and example

aefdebe

github-project-automation bot added this to CCCL Jan 1, 2026

github-project-automation bot moved this to Todo in CCCL Jan 1, 2026

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Jan 1, 2026

Use named constants

612b25c

This comment has been minimized.

Sign in to view

shwina added 2 commits January 3, 2026 11:21

Correct caching for ShuffleIterator

5e2e574

Docs cleanup

110a9e0

shwina marked this pull request as ready for review January 5, 2026 11:34

shwina requested review from a team as code owners January 5, 2026 11:34

shwina requested review from NaderAlAwar and alliepiper January 5, 2026 11:34

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 5, 2026

This comment has been minimized.

Sign in to view

RAMitchell reviewed Jan 5, 2026

View reviewed changes

gevtushenko linked an issue Jan 5, 2026 that may be closed by this pull request

cuda.cccl.parallel: Add shuffle_iterator #5684

Open

shwina added 2 commits January 5, 2026 20:34

Update shuffle iterator implementation to match C++ implementation (V…

0f91b36

…ariablePhilox)

Unnecessary commentary

4b17f52

shwina self-assigned this Jan 6, 2026

leofang reviewed Jan 6, 2026

View reviewed changes

leofang mentioned this pull request Jan 6, 2026

WIP: Avoid allocating an intermediate array in cupy.random.choice cupy/cupy#9570

Draft

RAMitchell mentioned this pull request Jan 8, 2026

Fixes for shuffle_iterator #7130

Merged

		return make_permutation_iterator(values, indices)


		def ShuffleIterator(num_items, seed):

Add shuffle iterator implementation, tests, and example #7062

Are you sure you want to change the base?

Add shuffle iterator implementation, tests, and example #7062

Uh oh!

Conversation

shwina commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance

Checklist

Uh oh!

copy-pr-bot bot commented Jan 1, 2026

Uh oh!

shwina commented Jan 1, 2026

Uh oh!

shwina commented Jan 1, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

RAMitchell left a comment

Choose a reason for hiding this comment

Uh oh!

RAMitchell Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

RAMitchell Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

RAMitchell Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

RAMitchell Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

shwina commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2026

😬 CI Workflow Results

🟥 Finished in 6h 41m: Pass: 79%/48 | Total: 2d 20h | Max: 6h 00m

Uh oh!

leofang Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

leofang commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jan 6, 2026

Uh oh!

RAMitchell commented Jan 6, 2026

Uh oh!

RAMitchell commented Jan 7, 2026

Uh oh!

shwina commented Jan 7, 2026

Uh oh!

RAMitchell commented Jan 7, 2026

Uh oh!

shwina commented Jan 7, 2026

Uh oh!

RAMitchell commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shwina commented Jan 1, 2026 •

edited

Loading

shwina commented Jan 5, 2026 •

edited

Loading

leofang commented Jan 6, 2026 •

edited

Loading