-
Notifications
You must be signed in to change notification settings - Fork 318
Add shuffle iterator implementation, tests, and example #7062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test aefdebe |
|
/ok to test 612b25c |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
RAMitchell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really speed to CCCL and how its python code is organised tested but some comments on the algorithm itself.
- We should use the Philox hash that is developed and tested in the paper https://arxiv.org/abs/2106.06161
- We need to statistically test the output distributions as this is a PRNG algorithm. Some ideas below.
For a very small permutation size (<5) count the number of times each permutation is generated. This output distrubition should be uniform and can be tested with a chi-squared test with scipy.
Generate a bunch of permutations - take the value of each permutation at index 0. This sequence should be a uniform integer distribution. Test it using scipy, repeat for each index.
The paper also gives a stronger statistical test but its more complicated to implement as we need the mallows kernel.
| Limitations | ||
| ----------- | ||
| - The resulting permutation is *not* uniformly sampled from all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement kind of feels like it might be correct but I think its nonsense.
| - The resulting permutation is *not* uniformly sampled from all | ||
| ``num_items!`` permutations. It is drawn from a large, structured family | ||
| of permutations induced by the Feistel construction. | ||
| - Cycle-walking may apply the Feistel permutation more than once per element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user doesn't care about the implementation detail of cycle walking. One consequence of this so called "cycle-walking" is the the worst case runtime for generating an element is O(n) where n is the permutation length - it is just vanishing unlikely. I am not sure we even need to mention this.
| return (R << hb) | L | ||
|
|
||
|
|
||
| def _splitmix64_host(x: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using splitmix is in theory fine, but we should use the VariablePhilox algorithm from this paper: https://arxiv.org/abs/2106.06161. This implemention needs to round up the sequence to the nearest power of 4, the one from the paper is the nearest power of 2. And its tested.
| result = d_output.get() | ||
|
|
||
| # Should be a valid permutation | ||
| assert len(set(result)) == num_items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the sorted permutation against cp.arange for a faster test.
|
Thanks for the great feedback, @RAMitchell. I've updated the implementation to use the VariablePhilox algorithm as mentioned; and to match the libcu++ implementation, we do 24 Feistel rounds. However, I still don't see strictly uniform distribution of permutations using the test you suggested. For
And as another case, for
So, while "good enough for government work", it's not as rigorous as maybe you would like. Now, I also noted that the C++ implementation suffers from a similar issue. The validation scripts, along with instructions for how to run them are in this gist: https://gist.github.com/shwina/2508ed08bc7c257dcf1834dfa574d7e6 |
😬 CI Workflow Results🟥 Finished in 6h 41m: Pass: 79%/48 | Total: 2d 20h | Max: 6h 00mSee results here. |
| return make_permutation_iterator(values, indices) | ||
|
|
||
|
|
||
| def ShuffleIterator(num_items, seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API design:
- As a library developer, being able to pass my internal
seedis great - As an end user, I sometimes don't want to think about it and would prefer if
seed=None(default), cuda.compute generates a random seed for me internally.
|
I can reproduce the observation in #7062 (comment) with cupy/cupy#9570 (re-implementation of @RAMitchell's shuffle_it = ShuffleIterator(n, seed)
cuda.compute.unary_transform(shuffle_it, d_perm, lambda x: x, n)by this keys = np.random.randint(
0, 0xFFFFFFFF + 1, size=24, dtype=np.uint32)
bijection = FeistelBijection(n, keys)
d_perm = bijection(n)Since we use different RNGs, this verifies that the observation is NOT due to the choice of RNG. |
|
(updated my comment above) |
|
Thanks @leofang, I've done the same experiment and something is definitely not quite right, either with our Philox implementation or there are simply too few bits at low sizes for this hash to work as intended. I will do some more investigation. |
|
I've tried a number of different approaches today, including strengthening the hash functions in the feistel cipher. The thing that matters in the end is just the minimum size of the sequence. 4 bits as a minimum seems to be too low, increasing it to as few as 6 bits seems to give the cipher enough to work with and passes uniformity tests. This means that we need to throw away more values for small sequences but this seems ok. |
|
@RAMitchell Thanks for taking a look. To be clear, does this mean we are OK with non strict-uniformity at lower sequence sizes; or do we need changes to the implementation at the C++ level? Tangentially, I've realized that we don't strictly need to reimplement the functionality in Python. Similar to what we do in |
|
No, we can have uniformity. We just have to use e.g. a bijection for 256 elements to generate a permutation of length 5. The tiny bijections don't work well with the feistel cipher. |
|
OK, would you like to push a fix for that in a separate PR, or would you like me to do that in this one? (it will probably take me more cycles) |
|
I will do it in a separate PR thanks. I would like to add a bunch more tests as well. |


Description
This PR adds a
ShuffleIteratorimplementation that uses a Feistel network based permutation function.Note - the implementation is completely vibe-coded, but I was pleased to find that it more or less matches the libcudacxx approach .
Performance
As reported in cupy/cupy#9320,
cp.random.choice(N)allocates a temporary of sizeNwhich can be slow and memory-inefficient when randomly drawing a small number of values from a large range. The stateless approach used byShuffleIteratorsolves that problem:Checklist