BUG: Ensure einsum uses chunking (now that nditer doesn't) (numpy#28043)

* BUG: Ensure einsum uses chunking (now that nditer doesn't) The nditer refactor enabled the `GROWINNER` for reductions, this was not the case before (I am not sure about cases where the reduction is in an outer axis, but I don't think so). Either way, as the test says, chunking always improves the precision of large sorts if their mean is nonzero and sklearn noticed lower precision. We could only remove growinner if there is a reduction, but it seems like a 1% performance hit for the simplest (non-trivial) case where GROWINNER had done something before. * Refine test and add code comment
abhishek-iitmadras · Dec 20, 2024 · 9aa5cda · 9aa5cda
1 parent 031a890
commit 9aa5cda
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 1 deletion.
diff --git a/numpy/_core/src/multiarray/einsum.c.src b/numpy/_core/src/multiarray/einsum.c.src
@@ -1032,10 +1032,13 @@ PyArray_EinsteinSum(char *subscripts, npy_intp nop,
                     NPY_ITER_NBO|
                     NPY_ITER_ALIGNED|
                     NPY_ITER_ALLOCATE;
+    /*
+     * Note: We skip GROWINNER here because this gives a partially stable
+     * summation for float64.  Pairwise summation would be better.
+     */
     iter_flags = NPY_ITER_EXTERNAL_LOOP|
             NPY_ITER_BUFFERED|
             NPY_ITER_DELAY_BUFALLOC|
-            NPY_ITER_GROWINNER|
             NPY_ITER_REFS_OK|
             NPY_ITER_ZEROSIZE_OK;
     if (out != NULL) {

diff --git a/numpy/_core/tests/test_einsum.py b/numpy/_core/tests/test_einsum.py
@@ -1282,3 +1282,30 @@ def test_overlap():
     #gh-10080, out overlaps one of the operands
     c = np.einsum('ij,jk->ik', a, b, out=b)
     assert_equal(c, d)
+
+def test_einsum_chunking_precision():
+    """Most einsum operations are reductions and until NumPy 2.3 reductions
+    never (or almost never?) used the `GROWINNER` mechanism to increase the
+    inner loop size when no buffers are needed.
+    Because einsum reductions work roughly:
+
+        def inner(*inputs, out):
+            accumulate = 0
+            for vals in zip(*inputs):
+                accumulate += prod(vals)
+            out[0] += accumulate
+
+    Calling the inner-loop more often actually improves accuracy slightly
+    (same effect as pairwise summation but much less).
+    Without adding pairwise summation to the inner-loop it seems best to just
+    not use GROWINNER, a quick tests suggest that is maybe 1% slowdown for
+    the simplest `einsum("i,i->i", x, x)` case.
+
+    (It is not clear that we should guarantee precision to this extend.)
+    """
+    num = 1_000_000
+    value = 1. + np.finfo(np.float64).eps * 8196
+    res = np.einsum("i->", np.broadcast_to(np.array(value), num)) / num
+
+    # At with GROWINNER 11 decimals succeed (larger will be less)
+    assert_almost_equal(res, value, decimal=15)