ENH: vectorize cramervonmises_2samp #58

jb-leger · 2021-03-10T18:58:48Z

@mdhaber: A improvment of cramer von mises test with vectorization. I think of this vectorization when I was reviewing your PR. This code depends on _all_partitions, therefore I submit a PR to your branch. I can wait the merge to scipy/master and propose a PR on scipy if you prefer (and in this case, please close this PR).

Before vectorization:

In [1]: import numpy as np
   ...: import scipy.stats as stats
   ...: np.random.seed(0)
   ...: a = np.random.uniform(size=10)
   ...: b = np.random.uniform(size=11)

In [2]: stats.cramervonmises_2samp(a, b, method='exact')
Out[2]: CramerVonMisesResult(statistic=0.12236652236652246, pvalue=0.5228767620408488)

In [3]: %timeit stats.cramervonmises_2samp(a, b, method='exact')
4.79 s ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After vectorization:

In [1]: import numpy as np
   ...: import scipy.stats as stats
   ...: np.random.seed(0)
   ...: a = np.random.uniform(size=10)
   ...: b = np.random.uniform(size=11)

In [2]: stats.cramervonmises_2samp(a, b, method='exact')
Out[2]: CramerVonMisesResult(statistic=0.12236652236652246, pvalue=0.5228767620408488)

In [3]: %timeit stats.cramervonmises_2samp(a, b, method='exact')
1.67 s ± 25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With this improvment, maybe the threshold (now it is 10, for the sum between nx and ny) for switching from exact mode to asymptotic mode for method="auto" can be adapted.

mdhaber · 2021-03-10T19:57:42Z

I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.
(The only reason I made it a generator function in scipygh-13661 is because cramervonmises_2samp wasn't vectorized, and I didn't want to mess with that at the time.)

On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion), if you're interested. If you go that route, I'd recommend implementing the distribution in a class.
scipygh-12531 shows a non-vectorized example. See _PageL.
scipygh-4933 is vectorized as best as I could. See _MWU.

Thanks for thinking of this!

jb-leger · 2021-03-10T21:34:15Z

I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.

In fact, generating a array with pythran can be very efficient, without a huge cost.

Let see:

meth1.py which is basically what we are doing now,
meth2.py is pure-python (but pythran compatible code) not compiled with pythran
meth3.py is just a symlink to meth2.py but compiled with pythran.

In [1]: import meth1, meth2, meth3                                                                               

In [2]: m1 = meth1.get_indices_permutation_test(21, 10) 
   ...: m2 = meth2.get_indices_permutation_test(21, 10) 
   ...: m3 = meth3.get_indices_permutation_test(21, 10)                                                          

In [3]: (m1==m2).all()                                                                                           
Out[3]: True

In [4]: (m1==m3).all()                                                                                           
Out[4]: True

In [5]: %timeit meth1.get_indices_permutation_test(21, 10)                                                       
4.39 s ± 89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit meth2.get_indices_permutation_test(21, 10)                                                       
2.66 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit meth3.get_indices_permutation_test(21, 10)                                                       
47.5 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).

jb-leger · 2021-03-10T21:43:05Z

On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion),

I don't know if we have recursion formula for this distribution. Do you have some results for this?

mdhaber · 2021-03-10T22:16:35Z

It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).

Yeah! That would be great.

I don't know if we have recursion formula for this distribution. Do you have some results for this?

I haven't looked. We can leave that to the future. Compilation will be useful for the random permutation tests, so in that case it makes sense to compile both.

Let's wait until scipygh-13661 is merged, which will need to wait for another stats maintainer. After that, if you submit a PR with these improvements, I can review and merge it.

In that case, I'm going to leave the code alone in scipygh-13661 - not replace with the set difference technique - because I think it will speed up review. I didn't write that mask-based code; I just moved it from cramervonmises_2samp, and I think it is easier to review the code if it hasn't changed.

mdhaber · 2021-03-11T07:38:03Z

Looks like there is some information about generating the CVM distribution efficiently here.

ENH: vectorize cramervonmises_2samp

84d146d

jb-leger mentioned this pull request Apr 6, 2021

ENH: stats: improve efficiency of / fix bug in exact permutation t-test scipy/scipy#13661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: vectorize cramervonmises_2samp #58

ENH: vectorize cramervonmises_2samp #58

jb-leger commented Mar 10, 2021

mdhaber commented Mar 10, 2021 •

edited

Loading

jb-leger commented Mar 10, 2021

jb-leger commented Mar 10, 2021

mdhaber commented Mar 10, 2021

mdhaber commented Mar 11, 2021

ENH: vectorize cramervonmises_2samp #58

Are you sure you want to change the base?

ENH: vectorize cramervonmises_2samp #58

Conversation

jb-leger commented Mar 10, 2021

mdhaber commented Mar 10, 2021 • edited Loading

jb-leger commented Mar 10, 2021

jb-leger commented Mar 10, 2021

mdhaber commented Mar 10, 2021

mdhaber commented Mar 11, 2021

mdhaber commented Mar 10, 2021 •

edited

Loading