Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: vectorize cramervonmises_2samp #58

Open
wants to merge 1 commit into
base: perm_ttest_efficiency
Choose a base branch
from

Conversation

jb-leger
Copy link

@mdhaber: A improvment of cramer von mises test with vectorization. I think of this vectorization when I was reviewing your PR. This code depends on _all_partitions, therefore I submit a PR to your branch. I can wait the merge to scipy/master and propose a PR on scipy if you prefer (and in this case, please close this PR).

Before vectorization:

In [1]: import numpy as np
   ...: import scipy.stats as stats
   ...: np.random.seed(0)
   ...: a = np.random.uniform(size=10)
   ...: b = np.random.uniform(size=11)

In [2]: stats.cramervonmises_2samp(a, b, method='exact')
Out[2]: CramerVonMisesResult(statistic=0.12236652236652246, pvalue=0.5228767620408488)

In [3]: %timeit stats.cramervonmises_2samp(a, b, method='exact')
4.79 s ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After vectorization:

In [1]: import numpy as np
   ...: import scipy.stats as stats
   ...: np.random.seed(0)
   ...: a = np.random.uniform(size=10)
   ...: b = np.random.uniform(size=11)

In [2]: stats.cramervonmises_2samp(a, b, method='exact')
Out[2]: CramerVonMisesResult(statistic=0.12236652236652246, pvalue=0.5228767620408488)

In [3]: %timeit stats.cramervonmises_2samp(a, b, method='exact')
1.67 s ± 25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With this improvment, maybe the threshold (now it is 10, for the sum between nx and ny) for switching from exact mode to asymptotic mode for method="auto" can be adapted.

@mdhaber
Copy link
Owner

mdhaber commented Mar 10, 2021

I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.
(The only reason I made it a generator function in scipygh-13661 is because cramervonmises_2samp wasn't vectorized, and I didn't want to mess with that at the time.)

On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion), if you're interested. If you go that route, I'd recommend implementing the distribution in a class.
scipygh-12531 shows a non-vectorized example. See _PageL.
scipygh-4933 is vectorized as best as I could. See _MWU.

Thanks for thinking of this!

@jb-leger
Copy link
Author

I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.

In fact, generating a array with pythran can be very efficient, without a huge cost.

Let see:

  • meth1.py which is basically what we are doing now,
  • meth2.py is pure-python (but pythran compatible code) not compiled with pythran
  • meth3.py is just a symlink to meth2.py but compiled with pythran.
In [1]: import meth1, meth2, meth3                                                                               

In [2]: m1 = meth1.get_indices_permutation_test(21, 10) 
   ...: m2 = meth2.get_indices_permutation_test(21, 10) 
   ...: m3 = meth3.get_indices_permutation_test(21, 10)                                                          

In [3]: (m1==m2).all()                                                                                           
Out[3]: True

In [4]: (m1==m3).all()                                                                                           
Out[4]: True

In [5]: %timeit meth1.get_indices_permutation_test(21, 10)                                                       
4.39 s ± 89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit meth2.get_indices_permutation_test(21, 10)                                                       
2.66 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit meth3.get_indices_permutation_test(21, 10)                                                       
47.5 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).

@jb-leger
Copy link
Author

On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion),

I don't know if we have recursion formula for this distribution. Do you have some results for this?

@mdhaber
Copy link
Owner

mdhaber commented Mar 10, 2021

It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).

Yeah! That would be great.

I don't know if we have recursion formula for this distribution. Do you have some results for this?

I haven't looked. We can leave that to the future. Compilation will be useful for the random permutation tests, so in that case it makes sense to compile both.

Let's wait until scipygh-13661 is merged, which will need to wait for another stats maintainer. After that, if you submit a PR with these improvements, I can review and merge it.

In that case, I'm going to leave the code alone in scipygh-13661 - not replace with the set difference technique - because I think it will speed up review. I didn't write that mask-based code; I just moved it from cramervonmises_2samp, and I think it is easier to review the code if it hasn't changed.

@mdhaber
Copy link
Owner

mdhaber commented Mar 11, 2021

Looks like there is some information about generating the CVM distribution efficiently here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants